A/B Testing And Experiment Analysis

What's being tested

Interviewers are probing whether you can reason from a product decision to a statistically valid experiment, not just define p-values or confidence intervals. For Meta Data Scientists, experimentation is central because ranking changes, notifications, ads, creator tools, integrity interventions, and onboarding flows all affect billions of user experiences with complex network effects. The skill being tested is your ability to choose metrics, design randomization, detect validity threats, interpret ambiguous results, and make a launch recommendation under business and product constraints. Strong answers show causal thinking, product judgment, and comfort with messy real-world experimentation systems.

Core knowledge

The basic estimand in most online experiments is the average treatment effect:
$\hat{\tau}=\bar{Y}_T-\bar{Y}_C$
with standard error
$SE(\hat{\tau})=\sqrt{\frac{s_T^2}{n_T}+\frac{s_C^2}{n_C}}.$
Be clear whether the effect is absolute, relative, per-user, per-session, or per-impression.
Start with the hypothesis and decision rule, not the test statistic. A strong framing is: “If this feature increases meaningful engagement without harming retention, integrity, latency, or revenue, we launch.” Meta cares about metric tradeoffs because optimizing clicks alone can degrade long-term ecosystem quality.
Choose the right randomization unit. User-level randomization is common for feed, notifications, and social products; session-level may work for stateless UI changes; cluster-level may be needed for network effects. Wrong unit choice causes interference when treated users affect control users through sharing, messaging, comments, or recommendations.
Define a primary metric, secondary diagnostics, and guardrails. Examples: primary could be 7-day retained users, meaningful social interactions, stories created, or ad revenue per user. Guardrails might include hide/report rate, unfollows, time spent beyond healthy thresholds, app crashes, p95 latency, notification opt-outs, or advertiser ROI.
Power analysis prevents underinterpreting noise. For a two-sided test, approximate sample size per group is
$n \approx \frac{2\sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2}{\delta^2},$
where $\delta$ is the minimum detectable effect. Tiny effect sizes at Meta scale can be statistically significant but not practically meaningful.
Ratio metrics need care. Metrics like CTR, revenue per impression, or comments per viewer are not simple averages unless defined at the right unit. Prefer user-level aggregation, delta method, bootstrap, or linearization: for ratio $R=\frac{\sum X}{\sum Y}$ , analyze $X_i-R_0Y_i$ to estimate variance more reliably.
Check sample ratio mismatch before trusting results. If expected allocation is 50/50 but observed treatment share is materially different, run a chi-square SRM test. SRM can indicate logging bugs, eligibility leakage, bot filtering differences, ramp issues, or treatment affecting whether users are counted.
Sequential peeking inflates false positives. If teams inspect results daily and stop when $p<0.05$ , Type I error rises. Use pre-registered fixed horizons, alpha spending, group sequential methods, always-valid p-values, or Bayesian decision frameworks if continuous monitoring is operationally required.
Multiple testing matters when slicing by country, device, age, creator type, or metric family. Bonferroni is conservative; Benjamini-Hochberg controls false discovery rate. For Meta-scale analyses, emphasize a primary metric plus pre-specified heterogeneity checks rather than “shopping” for significant segments.
Variance reduction can materially improve sensitivity. CUPED uses a pre-experiment covariate $X$ to adjust outcomes:
$Y_i' = Y_i - \theta(X_i-\bar{X}), \quad \theta=\frac{Cov(Y,X)}{Var(X)}.$
This works well when past behavior predicts future behavior, such as historical sessions, impressions, or revenue.
Interpret null, mixed, and delayed effects carefully. A null may mean no effect, low power, metric dilution, poor exposure, or wrong population. Novelty effects can cause short-term spikes; learning effects may emerge later. Retention and ecosystem-quality metrics often need longer windows than click metrics.
Real experiments require launch judgment. A statistically significant +0.1% increase in comments may not launch if reports, spam, or latency worsen. Conversely, a neutral primary metric with strong guardrails and strategic value may justify ramping to a holdout or longer-term test.

Worked example

For “Design an A/B test for a new Facebook News Feed ranking feature,” a strong candidate would first clarify the product goal: is the model intended to increase meaningful engagement, reduce low-quality content, improve creator distribution, or increase time spent? They would also ask about exposure: does every feed viewer receive the new ranking, or only users with enough inventory for the model to matter? The answer should be organized around five pillars: hypothesis, unit of randomization, metrics, validity checks, and launch decision. The candidate might propose user-level randomization, with a primary metric such as meaningful interactions per daily active user or 7-day retention, plus guardrails like hides, reports, unfollows, misinformation prevalence, latency, and ad revenue. They should explicitly call out interference: feed ranking changes can affect creators and friends, so standard user-level A/B tests may underestimate network spillovers. One design tradeoff is speed versus validity: a 1% ramp gives early safety signal but may be underpowered for retention or rare integrity harms, while a larger ramp increases risk if the ranking model is bad. The candidate should mention SRM checks, exposure logging, pre-period balance, and segment cuts by country, platform, and new versus mature users. They would close by saying that if results are positive on the primary metric, neutral or positive on guardrails, and robust across key segments, they would ramp gradually while maintaining a long-term holdout. If they had more time, they would add long-term ecosystem analysis, creator-side effects, and possibly cluster-level testing to measure spillovers.

A second angle

For “Evaluate whether a new notification feature should be launched,” the same experimentation logic applies, but the constraints shift toward fatigue, opt-outs, and repeated exposure. The primary metric might be incremental sessions or meaningful actions generated per user, but guardrails become especially important: notification disable rate, app uninstalls, spam reports, negative feedback, and long-term retention. Randomization should likely be user-level because notification policies persist over time and affect user habits. The candidate should also discuss frequency capping and heterogeneous treatment effects, since a notification may help dormant users but annoy highly active users. Unlike a feed ranking test, short-term lift can be misleading if it borrows attention from future sessions or other channels.

Common pitfalls

A common analytical mistake is treating every metric movement as causal and equally important. For example, saying “CTR increased, so launch” is weak if the feature also increases hides, reports, or notification opt-outs. A better answer identifies a primary success metric, interprets secondary metrics diagnostically, and weighs practical significance against user harm.

A common communication mistake is jumping straight into formulas before defining the product decision. Interviewers want to hear how the experiment informs a launch, ramp, rollback, or iteration decision. Start with the business goal and hypothesis, then introduce statistical machinery as the way to make that decision reliable.

A common depth mistake is ignoring interference and logging validity. At Meta, users are connected, content producers respond to distribution, and ranking or sharing changes can spill over between treatment and control. Strong candidates proactively mention SRM, exposure logging, treatment contamination, network effects, and whether cluster randomization or long-term holdouts are needed.

Connections

Interviewers may pivot from experimentation into causal inference, especially difference-in-differences, instrumental variables, propensity score methods, or synthetic controls when randomized tests are infeasible. They may also push on metric design, ranking systems, heterogeneous treatment effects, marketplace dynamics, or product analytics tradeoffs between engagement, retention, revenue, and integrity.