A/B Testing And Experiment Design

What's being tested

Meta Data Scientists are expected to design experiments that produce credible product and business decisions, not just compute a p-value. These prompts test whether you can define the estimand, choose the randomization unit, specify primary and guardrail metrics, reason about power, and diagnose ambiguous or null results. The interviewer is probing for practical judgment: how you handle interference in social/networked products, ads marketplace tradeoffs, noisy metrics, heterogeneous effects, and launch decisions under uncertainty. A strong answer sounds like an experiment owner who can prevent biased conclusions before data is collected and explain the result clearly after it lands.

Core knowledge

Start with the decision and estimand: define what action the experiment informs and the causal quantity, e.g. average treatment effect $ATE = E[Y(1)-Y(0)]$ for eligible users. For ads, clarify whether the estimand is user welfare, advertiser value, platform revenue, or marketplace efficiency.
Randomization unit must match interference risk. User-level randomization works when one user’s treatment does not affect another’s outcome. In social feeds, messaging, auctions, shops, and creator ecosystems, SUTVA may fail; consider cluster, geo, page, advertiser, or marketplace-level randomization.
Metric hierarchy matters. Pick one primary metric such as revenue_per_user, conversion_rate, watch_time, or purchase_rate; define guardrails like hide_rate, report_rate, latency, retention, advertiser ROAS, and user experience metrics. Avoid declaring success from a post-hoc metric that moved favorably.
Power and sample size should be tied to a minimum detectable effect. For a two-sample mean comparison with equal allocation, approximate:
$n \approx \frac{2(z_{1-\alpha/2}+z_{1-\beta})^2\sigma^2}{\delta^2}$
where $\delta$ is the MDE. For binary metrics, use $\sigma^2=p(1-p)$ ; for skewed revenue metrics, empirical variance or bootstrap estimates are more reliable.
Variance reduction is often expected at Meta scale. CUPED adjusts outcomes using pre-experiment covariates: $Y_{adj}=Y-\theta(X-\bar X)$ , where $\theta=\frac{Cov(Y,X)}{Var(X)}$ . It increases sensitivity when pre-period behavior strongly predicts post-period outcomes, common for DAU, spend, engagement, and purchase metrics.
Clustered or dependent data needs different inference. If users are randomized by cluster, effective sample size is reduced by intra-cluster correlation: $DE = 1 + (m-1)\rho$ . Use cluster-robust standard errors or analyze at the cluster level; pretending millions of user rows are independent will overstate significance.
Interference requires exposure modeling. Under network effects, define exposures such as “treated user with at least 30% treated friends” or “control user exposed to treated sellers.” You may estimate direct, indirect, and total effects, but you must state assumptions about how treatment propagates through the graph.
Ads and ranking tests have marketplace externalities. A new shop-ads algorithm can change auction prices, advertiser budgets, organic content distribution, and user engagement. Randomizing users may measure user-side impact but miss advertiser budget reallocation; randomizing advertisers may measure advertiser value but contaminate user experience.
Null results are not automatically failures. A null can mean no effect, underpowered design, instrumentation issues, dilution from weak exposure, heterogeneous effects canceling out, or a metric too far downstream. Check confidence intervals: “we can rule out effects larger than +0.3%” is stronger than “p > 0.05.”
Multiple testing and peeking inflate false positives. If many segments or metrics are tested, use pre-registration, metric hierarchy, holdouts, or corrections such as Bonferroni, Benjamini-Hochberg, or alpha spending. Sequential monitoring is valid only if the stopping rule is accounted for.
Heterogeneous treatment effects should be planned, not mined. Segment by pre-specified cohorts like new vs existing users, high vs low spenders, country, device, or advertiser size. For Meta-style products, treatment may help creators or advertisers while hurting casual users; the launch recommendation should reflect this tradeoff.
Analysis should include diagnostics before interpretation. Check sample ratio mismatch, pre-period balance, treatment exposure, metric logging sanity, outliers, novelty effects, ramp timing, and day-of-week effects. SRM is especially serious: if assignment is 50/50 but observed traffic is 48/52, causal validity is questionable.

Worked example

For “Design an A/B test for a new shop-ads algorithm,” a strong candidate would first clarify the product change: is the algorithm changing ranking, retrieval, bidding, or targeting, and who is eligible to see shop ads? They would define the decision: launch if the new model improves marketplace value without degrading user experience or advertiser outcomes. The answer can be organized around four pillars: experiment setup, metrics, statistical analysis, and launch interpretation. For setup, they might choose user-level randomization if the main exposure is ad ranking in a user feed, but explicitly flag that advertiser budget competition creates interference, so a geo- or advertiser-level test may be needed for marketplace-level effects. For metrics, they would name a primary metric such as incremental purchase_value_per_user or ads_revenue_per_user, plus guardrails like hide_rate, report_rate, session engagement, advertiser ROAS, and small-advertiser spend concentration. For analysis, they would discuss power based on expected traffic and variance, CUPED using pre-period purchase or ad engagement, and segment checks for new shoppers, heavy shoppers, and advertiser categories. A specific tradeoff to flag: user-level randomization gives high power and clean user experience measurement, but it may underestimate budget reallocation or auction price effects. They would close by saying that, if time allowed, they would add a longer holdout or geo-level validation to capture advertiser budget dynamics and delayed purchase behavior.

A second angle

For “Design and analyze A/B test with interference,” the same experimental toolkit applies, but the core issue shifts from metric selection to causal identification. Instead of assuming each user’s outcome depends only on their own assignment, you need to model exposure through friends, groups, sellers, creators, or shared auctions. A strong answer might propose cluster randomization on graph communities, ego-network designs, or saturation experiments where clusters receive different treatment probabilities. The key difference is that the estimand may be direct effect, spillover effect, or total network effect rather than a simple user-level ATE. The analysis must use cluster-level or exposure-level inference, because independent row-level standard errors would be misleading.

Common pitfalls

Pitfall: Treating every experiment as a 50/50 user-level randomized controlled trial.

That answer is tempting because it is simple and often correct for isolated UI changes. It fails for social, ads, commerce, creator, and marketplace systems where one unit’s treatment can affect another unit’s outcome. A better answer says, “I would use user-level randomization if SUTVA is plausible; otherwise I would consider cluster, geo, advertiser, or saturation designs.”

Pitfall: Optimizing for one metric without a metric hierarchy.

Saying “launch if revenue increases significantly” is incomplete for Meta-style decisions. Ads revenue may rise while retention, hide_rate, advertiser ROAS, or content quality worsens. A stronger answer names one primary metric, a small set of guardrails, and the decision rule before looking at results.

Pitfall: Explaining a null result as “the feature does not work.”

A null result could come from insufficient power, low treatment exposure, high variance, heterogeneous effects, or delayed impact. The stronger response is to inspect confidence intervals, exposure rates, pre/post diagnostics, and planned segments, then state whether the experiment rules out a practically meaningful effect.

Connections

Interviewers may pivot from experiment design into causal inference, especially difference-in-differences, instrumental variables, or propensity methods when randomization is not possible. They may also connect this to metric design, ranking/recommender evaluation, marketplace analytics, or diagnosing anomalies in DAU, revenue, engagement, and conversion funnels.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts