A/B Testing

What's being tested

Interviewers are probing whether you can design, diagnose, and interpret online experiments for ranking, feed, video, shopping, and homepage surfaces where user behavior is noisy and product decisions are high-stakes. For Pinterest, a Data Scientist must connect statistical rigor to user value: a feed-ranking change may increase CTR while hurting long-term saves, creator diversity, or shopping intent. You are expected to define metrics, choose the right randomization unit, estimate power, detect experiment validity issues, and explain causal impact clearly. You may also be asked what to do when clean randomization is unavailable, which tests your ability to reason about causal identification rather than just run a t-test.

Core knowledge

Randomized controlled trials estimate causal effects by making treatment independent of potential outcomes: $T \perp (Y(1), Y(0))$ . For Pinterest surfaces, randomize at the user_id level when exposure persists across sessions; pageview-level randomization can create contamination if users see mixed feed experiences.
Primary metrics should map to the product hypothesis. For a feed-ranking algorithm, candidates might use engaged_sessions_per_user, saves_per_user, closeups_per_user, or repins_per_user; for video pins, video_starts, watch_time, completion rate, and downstream saves may matter more than raw impressions.
Guardrail metrics protect against local wins that harm the ecosystem. Common Pinterest-style guardrails include session_length, hide_rate, report_rate, creator_distribution, search_return_rate, notification_unsub_rate, latency-sensitive engagement, and shopping funnel health such as product_clicks or checkout_intent.
Hypothesis framing should be explicit: null $H_0: \Delta = 0$ versus alternative $H_A: \Delta \ne 0$ or $\Delta > 0$ . Define the minimal detectable effect before launch, e.g., “detect a 0.5% relative lift in saves per user at 80% power and $\alpha=0.05$ .”
Power and sample size depend on variance, baseline rate, significance level, and MDE. For a mean metric, approximate per-arm sample size is $n \approx \frac{2(z_{1-\alpha/2}+z_{1-\beta})^2\sigma^2}{\delta^2}$ where $\delta$ is the absolute MDE. Smaller MDEs require quadratically more users.
Ratio metrics like CTR = clicks / impressions require care because numerator and denominator are both affected by ranking. Prefer user-level aggregation, delta method, bootstrap, or cluster-robust standard errors rather than treating every impression as independent.
CUPED can reduce variance by adjusting for pre-experiment behavior: $Y_{adj}=Y-\theta(X-\bar X)$ , where $X$ is a pre-period covariate and $\theta=\text{Cov}(Y,X)/\text{Var}(X)$ . It works well for stable user-level metrics like historical saves or prior engagement.
Sample ratio mismatch is a validity red flag: if traffic is intended 50/50 but observed assignment is 49/51 with large $N$ , run a chi-square test. Causes may include targeting filters, logging gaps, eligibility rules, or assignment bugs; do not interpret treatment effects until the mismatch is explained.
Multiple testing inflates false positives when slicing by many metrics or segments. Pre-register one primary metric, treat secondary metrics as diagnostic, and use methods such as Bonferroni correction, Benjamini-Hochberg FDR, or hierarchical decision rules when many comparisons drive launch decisions.
Heterogeneous treatment effects matter for Pinterest because new users, power users, creators, shoppers, and video-heavy users may respond differently. Segment analysis should be hypothesis-driven; avoid “segment fishing” unless clearly labeled exploratory and validated in a follow-up experiment.
Novelty and learning effects can distort short-term results. A new shopping module may initially attract curiosity clicks but not durable purchase intent; a ranking change may need days for users and models to adapt. Consider ramp duration, burn-in windows, and cohort-based readouts.
Non-randomized causal methods require identification assumptions. Without a control group, use difference-in-differences, synthetic control, interrupted time series, propensity score weighting, or matched controls, but clearly state assumptions like parallel trends, no concurrent shocks, and stable treatment exposure.

Worked example

For “Evaluate New Feed-Ranking Algorithm with A/B Testing”, a strong candidate would start by clarifying the product goal: is the new ranker optimizing engagement, long-term retention, content relevance, shopping intent, or creator ecosystem quality? They would ask about eligible users, exposure surface, rollout constraints, and whether the model changes only ranking or also candidate generation. The answer can be organized around four pillars: experiment design, metric framework, statistical plan, and diagnostics.

For design, randomize at the user_id level to avoid within-user contamination and keep treatment sticky across sessions. For metrics, choose one primary metric such as saves_per_user or engaged_sessions_per_user, plus guardrails like hide_rate, report_rate, session_length, and content diversity. For the statistical plan, define the MDE, estimate sample size using historical variance, and specify the inference method, likely user-level difference in means with CUPED if pre-period engagement is predictive. For diagnostics, check sample ratio mismatch, event logging sanity, exposure rates, pre-period balance, and whether treatment actually changed ranking outputs such as average rank position or content mix.

One tradeoff to flag is that CTR may improve if the ranker promotes clickbait-like pins, while saves or long-term return behavior may worsen; therefore, launch criteria should not rely on a single shallow engagement metric. A polished close would be: “If I had more time, I’d inspect heterogeneous effects for new versus retained users and run a longer holdout or follow-up to measure durability.”

A second angle

For “Recover causal effect without a control group”, the same experimentation mindset applies, but the central issue shifts from randomization to identification. Instead of saying “just compare before and after,” a strong candidate would ask whether there is a plausible untreated comparison: unaffected geographies, ineligible users, similar surfaces, or historical time periods. If no direct control exists, they might propose interrupted time series with seasonality controls, synthetic control built from comparable cohorts, or difference-in-differences if a credible comparison group can be found. The answer should emphasize assumptions and validation: pre-trend checks, placebo intervention dates, negative-control metrics, and sensitivity to concurrent launches. The conclusion should be probabilistic rather than overconfident, because observational estimates are usually less defensible than a clean randomized test.

Common pitfalls

Pitfall: Treating impressions as independent observations.

A tempting but wrong answer is to run a two-proportion z-test on all pin impressions and declare significance from millions of rows. The better approach is to aggregate to the randomization unit, usually user_id, because repeated impressions from the same user are correlated and the treatment is assigned at the user level.

Pitfall: Optimizing for the metric that moved instead of the metric that matters.

Candidates often say “CTR increased, so launch,” without asking whether the change harmed saves, long-term retention, shopping conversion quality, or negative feedback. A stronger answer separates the primary decision metric, secondary diagnostics, and guardrails, then explains how conflicting movements would be adjudicated.

Pitfall: Hand-waving causal inference when no control exists.

For no-control scenarios, “use regression” is not enough. Interviewers want to hear the identification assumption, why it might be plausible, how you would test it with pre-trends or placebo checks, and what uncertainty remains after the analysis.

Connections

This topic often pivots into metric design, causal inference, ranking evaluation, product analytics, and sequential testing. Be ready to discuss how offline recommender metrics like NDCG, calibration, or relevance labels relate imperfectly to online A/B outcomes, and how to diagnose metric movements by cohort, funnel stage, and content type.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts