A/B Testing And Experiment Design

What's being tested

Experiment design for Meta Data Scientists is about turning ambiguous product or ML changes into a credible causal estimate: what changed, for whom, by how much, and whether the tradeoff is worth shipping. Interviewers are probing whether you can choose the right randomization unit, define primary metrics and guardrails, reason about interference, and diagnose conflicting results without over-claiming. Meta cares because changes to ads ranking, messaging, calling, and notification systems can create large aggregate effects while harming specific user cohorts, advertisers, creators, or network interactions. A strong answer sounds like a launch-readiness memo: causal design first, metrics second, risks and interpretation throughout.

Core knowledge

Hypothesis framing should start with the product mechanism, not the statistical test. For a new ads algorithm, the causal claim might be: “ranking improves ad relevance, increasing advertiser value without degrading user experience.” That maps to CTR, CVR, ROAS, revenue, hide_rate, and session-level engagement.
Randomization unit must match the exposure mechanism. User-level randomization works for individual UI features like pinned unread messages; request-level randomization may contaminate user experience; call-level randomization can fail for WhatsApp because both caller and callee experience the treatment, requiring pair-, cluster-, or user-level assignment rules.
Eligibility and exposure are distinct. Eligible users are those who could receive treatment; exposed users actually encounter it. Analyze primary effects on an intent-to-treat basis to preserve randomization, then use exposure-triggered or complier average causal effect analysis as secondary evidence when treatment exposure is sparse.
Primary metric selection should reflect the decision criterion. Ads experiments often need a marketplace scorecard: user metrics like ad_hide_rate, advertiser metrics like CPA or ROAS, platform metrics like revenue, and auction health metrics like impressions or bid_density. Pick one primary metric before launch to avoid cherry-picking.
Guardrail metrics protect against local optimization. A model may improve CTR by showing clickbait ads, hurt long-term retention, increase negative_feedback, or shift spend away from small advertisers. For calling reliability, guardrails may include call_attempts, call_duration, battery_usage, data_usage, and crash rate.
Power and sample size require baseline rate, variance, desired minimum detectable effect, significance level, and power. For a two-sample mean comparison, a rough formula is:
$n \approx \frac{2\sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2}{\Delta^2}$
For binary metrics, use $\sigma^2=p(1-p)$ and check whether rare events require longer runtime or aggregation.
Clustered observations matter when users generate repeated events. Treating 10,000 calls from 1,000 users as independent overstates precision. Use user-level aggregation, cluster-robust standard errors, or mixed models; the effective sample size is closer to the number of independent users or clusters than the number of events.
Interference violates standard A/B assumptions when one user’s treatment affects another’s outcome. Messaging, calling, groups, marketplaces, and ads auctions all have this risk. Use cluster randomization, ego-network designs, geo experiments, or switchback tests when spillovers are material and cannot be ignored.
Ads experiments have marketplace-specific complications: budget constraints, auction competition, pacing, learning systems, and advertiser strategic behavior. Randomizing users may change impression opportunities; randomizing advertisers may change auction pressure. A clean design often combines user-side randomization with advertiser-level diagnostics and budget-normalized metrics.
Sequential monitoring must be planned. Repeatedly checking p_value < 0.05 inflates false positives. Use fixed-horizon testing, alpha-spending, group sequential methods, or always-valid inference. If the interviewer asks about early stopping, say whether it is for safety, futility, or success and how error rates are controlled.
Heterogeneous treatment effects are often as important as the average. Segment by country, device class, network quality, new versus existing users, advertiser size, campaign objective, or baseline engagement. Pre-specify critical segments; use shrinkage or false discovery control when scanning many cuts.
Decision-making should separate statistical significance from practical significance. A +0.05% lift in revenue may be huge at Meta scale, while a statistically significant +0.5% in CTR may not ship if it worsens hide_rate, advertiser ROI, or long-term user engagement. Tie the final call to expected impact and risk.

Worked example

For “Design an A/B test for WhatsApp call reliability,” a strong candidate would first clarify what “reliability” means: call setup success, drop rate, audio quality, latency, reconnect success, or user-perceived quality. They would also ask whether the change is client-side, server-side routing, codec selection, or network fallback, because that determines exposure and interference. The answer can be organized around five pillars: randomization, metrics, statistical unit, rollout, and diagnostics.

A reasonable design might randomize at the user level, but define treatment exposure carefully for calls where one party is treatment and the other is control. The candidate should flag that a call is a two-sided interaction: if only the caller receives the new reliability stack, the callee may still affect call quality, so analysis needs caller/callee treatment-status buckets such as T-T, T-C, C-T, and C-C. Primary metrics could include call_setup_success_rate and unexpected_drop_rate, with guardrails like call_attempts, duration, data_usage, battery_usage, and crash rate.

The key tradeoff is between clean causal isolation and operational feasibility. Pair-level or cluster-level randomization reduces contamination but may need more sample size; user-level randomization is easier but requires careful interpretation of mixed-treatment calls. A strong close would say: “I would ship only if reliability improves on the pre-specified primary metric, guardrails are neutral, and gains hold in low-connectivity cohorts; with more time, I’d examine heterogeneous effects by country, OS, app version, and network type.”

A second angle

For “Design an A/B test for a new shop-ads algorithm,” the same causal framework applies, but the constraints shift from two-sided communication to marketplace dynamics. The randomization unit might be users, advertisers, auctions, or geos, and each choice answers a slightly different question. User-level randomization estimates the impact on user experience and platform revenue, but advertiser budgets and auction competition can create interference across treatment and control inventory. The candidate should discuss metrics across multiple stakeholders: CTR, CVR, ROAS, CPA, revenue, ad_hide_rate, and advertiser spend distribution. Unlike calling reliability, the hardest issue is often not repeated measures alone but equilibrium effects from auctions, budgets, and ranking feedback loops.

Common pitfalls

Pitfall: Treating every event as independent.

A tempting but weak answer is, “We have millions of impressions or calls, so power is not an issue.” That ignores repeated users, clustered sessions, and network effects. A better answer aggregates or models at the user, advertiser, call-pair, geo, or cluster level depending on the exposure mechanism.

Pitfall: Optimizing one metric without a decision framework.

For ads, saying “ship if CTR goes up” is incomplete because higher click-through can come from lower-quality ads or worse advertiser ROI. For reliability, saying “ship if duration increases” may be wrong if longer calls reflect failed reconnect loops. Always define a primary metric, guardrails, and the interpretation of metric movement.

Pitfall: Communicating a statistical recipe instead of an experiment plan.

Interviewers are not looking for a generic “run a t-test at 95% confidence” response. They want to hear assumptions, randomization, exposure, power, interference, segmentation, and launch criteria. Lead with the causal design, then mention the test statistic as an implementation detail.

Connections

This topic often pivots into causal inference, especially difference-in-differences, matching, instrumental variables, or synthetic controls when randomized experiments are infeasible. It also connects to metric design, ranking evaluation, marketplace experimentation, and network effects in social products. For ML-heavy prompts, be ready to bridge offline model metrics like AUC, NDCG, or calibration with online business and user-experience metrics.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts