A/B Testing

What's being tested

Meta Data Scientist interviews for A/B testing probe whether you can turn an ambiguous product or algorithm change into a credible causal measurement plan. The interviewer is looking for judgment on randomization unit, metric choice, guardrails, power, interference, and how you would interpret messy results like “engagement up, quality down” or “revenue up, user experience down.” This matters at Meta because feeds, ads, messaging, calling, and integrity systems have network effects, repeated exposure, ranking algorithms, and heterogeneous user populations. A strong answer is not just “run a 50/50 A/B test”; it explains what treatment means, who is eligible, what can bias the estimate, and what decision the experiment will support.

Core knowledge

Average treatment effect is the default estimand: $ATE = E[Y_i(1) - Y_i(0)]$ You should state whether you care about user-level, session-level, advertiser-level, creator-level, or call-level impact, because changing the estimand changes randomization, variance, and interpretation.
Randomization unit should match exposure. For a feed recommender, user-level assignment is usually clean because each user receives one ranking policy. For WhatsApp calls, caller-level assignment can contaminate callee experience, so pair-level, conversation-level, or cluster-level assignment may be needed.
SUTVA assumes no interference and one version of treatment. Meta products often violate this: ads auctions create advertiser competition, bot mitigation changes organic interaction patterns, and social feeds create spillovers across friends. Name the risk and propose cluster randomization, holdouts, marketplace-level tests, or explicit spillover measurement.
Primary metric should align with the launch decision. For Instagram short-video ranking, a reasonable primary metric might be user-level watch_time_per_DAU, qualified views, or sessions with meaningful engagement; for ads, use incremental advertiser value, CTR, conversion value, or revenue only with user-experience guardrails.
Guardrail metrics catch regressions outside the primary goal. Common Meta guardrails include DAU, session starts, hides, reports, unfollows, negative feedback, app crashes, latency, call drops, ad load, long-term retention, creator distribution, and integrity metrics such as spam exposure or bot prevalence.
Power and MDE should be discussed quantitatively, not hand-waved. For a two-sample test with equal allocation, approximate per-arm sample size is $n \approx \frac{2\sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2}{\delta^2}$ where $\delta$ is the minimum detectable effect. Heavy-tailed metrics like revenue or watch time often require winsorization, log transforms, or user-level aggregation.
Unit of analysis should usually match unit of randomization. If users are randomized, compute user-level outcomes and compare means across users. Treating millions of sessions as independent when only thousands of users were randomized underestimates standard errors and creates false positives.
Variance reduction methods like CUPED improve sensitivity when pre-period behavior predicts outcome: $Y_i^{adj}=Y_i-\theta(X_i-\bar X),\quad \theta=\frac{Cov(Y,X)}{Var(X)}$ This is especially useful for stable metrics like watch time, call reliability, or advertiser spend, but avoid covariates affected by treatment.
Experiment duration must cover product cycles and novelty effects. A feed ranking change may need at least 1–2 full weekly cycles; ads changes may need advertiser budget pacing cycles; bot mitigation may show adaptation as adversaries respond. Do not stop just because the first day is significant.
Multiple testing matters when many segments, metrics, or variants are examined. Pre-register the primary metric and key guardrails; use false discovery rate, Bonferroni, or hierarchical testing for large metric families. Segment cuts like country, device, age band, and new versus existing users should be framed as diagnostic unless powered.
Conflicting signals require a decision framework. If revenue rises but negative feedback and hides increase, do not average them casually. Explain tradeoffs, quantify practical significance, check heterogeneity, inspect distributional effects, and recommend launch, no-launch, ramp, or iteration based on pre-agreed thresholds.
Ramp strategy is part of experimental reasoning. Start with a small exposure for safety, then expand to powered allocation after instrumentation checks. For networked or marketplace systems, keep a persistent holdout when long-term effects, equilibrium changes, or adversarial adaptation are expected.

Worked example

For Measure impact of bot mitigation via experiment, a strong candidate would start by clarifying what the mitigation changes: does it reduce bot account creation, demote suspected bot actions, block messages, or remove fake engagement from ranking signals? Then they would define the decision: “I want to estimate the causal effect on authentic user experience and platform integrity, while ensuring we do not accidentally suppress legitimate users.” The answer can be organized around four pillars: randomization and eligibility, primary and guardrail metrics, power and duration, and bias/interference risks.

For randomization, the candidate might propose user- or account-level assignment for accounts subject to mitigation, while explicitly noting contamination: bots can create new accounts, interact with control users, or adapt after observing enforcement. Metrics should include bot-action reduction, spam reports, authentic engagement, messages received from suspicious accounts, false-positive appeal rates, and retention of legitimate users. A key tradeoff is whether to exclude already-known bots: excluding them may make the experiment safer but hides the treatment effect on the population the system is meant to affect. The candidate should also flag that measured engagement may fall because fake engagement is removed, which is not necessarily bad; quality-adjusted engagement and user reports are better than raw likes or follows. They would close by saying that, with more time, they would add a longer-term holdout to measure adversarial adaptation and segment results by suspicion score, geography, account age, and surface.

A second angle

For Evaluate Instagram's Short-Video Recommender System Success, the same experimentation concepts apply, but the treatment is a ranking algorithm rather than an enforcement system. The cleanest unit is usually user-level randomization, because each viewer can consistently receive either the existing or new recommender in their feed. The primary metric might be user-level meaningful watch time or satisfied sessions, while guardrails should include hides, reports, “not interested” actions, creator concentration, session length extremes, and downstream retention. The main constraint is distributional: a recommender can improve average watch time by over-serving addictive or low-quality content, so a strong answer discusses percentiles, user cohorts, content diversity, and long-term retention rather than only mean engagement. Unlike bot mitigation, adversarial response is less central, but feedback loops and creator ecosystem effects become more important.

Common pitfalls

Pitfall: Choosing a convenient metric instead of a decision metric.

A tempting answer is “use CTR for ads” or “use watch time for video” without explaining why it represents success. A better answer states the product objective, selects one primary metric tied to that objective, and adds guardrails for user trust, advertiser value, content quality, and long-term retention.

Pitfall: Ignoring interference and exposure.

Many candidates default to user-level randomization even when treatment affects both sides of an interaction. For WhatsApp call reliability, the caller and callee share the same call outcome; for ads, one advertiser’s treatment can change auction prices for others. You should explicitly ask who experiences the treatment and whether one unit’s treatment can affect another unit’s outcome.

Pitfall: Over-indexing on statistical significance.

“p-value < 0.05, so launch” is rarely enough. Interviewers expect you to discuss effect size, confidence intervals, novelty effects, multiple metrics, segment heterogeneity, and whether the observed change is practically meaningful. A small statistically significant revenue lift with a large increase in reports or call drops may be a no-launch.

Connections

Interviewers often pivot from A/B testing into causal inference, including difference-in-differences, synthetic controls, instrumental variables, and observational evaluation when randomized experiments are infeasible. They may also probe metric design, ranking evaluation, marketplace dynamics in ads auctions, or long-term holdout design for ecosystem effects.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts