A/B Testing Design And Launch Decisions

What's being tested

Interviewers are probing whether you can design an experiment that produces a trustworthy launch decision under real product constraints, not whether you can recite “control versus treatment.” At Meta, many changes affect billions of user sessions, ranking systems, ads delivery, creator ecosystems, and network interactions, so a Data Scientist must reason about causality, metrics, tradeoffs, and operational risk. The core skill is translating an ambiguous product change into a valid randomized test, choosing success and guardrail metrics, diagnosing experiment health, and making a launch recommendation despite noisy or conflicting evidence.

Core knowledge

Start by defining the decision, not the test. Clarify whether the goal is launch/no-launch, estimate impact, compare variants, tune a parameter, or detect harm. A Feed ranking change, for example, may optimize engagement but must also protect user satisfaction, integrity, ads revenue, and latency.
Choose the randomization unit to match the causal estimand. User-level randomization is common for product UX changes; session-level can increase power but risks contamination; cluster-level randomization may be needed for social/network effects. For marketplace or social graph features, interference violates SUTVA: one user’s treatment can affect another user’s outcome.
Define a primary metric before launch. Good primary metrics are sensitive, interpretable, hard to game, and aligned with long-term product value. Examples: DAU/WAU retention, sessions per user, meaningful interactions, creator posting, revenue per user, negative feedback rate, integrity reports, hide/unfollow actions, crash rate, and p95 latency.
Separate success metrics from guardrails. A change may increase time spent but also increase misinformation reports or notification opt-outs. A strong launch decision uses an objective function like “launch if primary metric improves by at least X and no guardrail regresses by more than Y,” rather than cherry-picking statistically significant wins.
Know the basic power equation. For a two-sample mean comparison with equal allocation, approximate sample size per arm is:
$n \approx \frac{2\sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2}{\delta^2}$
where $\delta$ is the minimum detectable effect. Lower-variance metrics, CUPED, and longer duration improve power; rare events like 7-day retention or harmful-content reports often require much larger samples.
Use variance reduction where appropriate. CUPED adjusts outcomes using pre-experiment covariates:
$Y' = Y - \theta(X-\bar X), \quad \theta = \frac{\operatorname{Cov}(Y,X)}{\operatorname{Var}(X)}$
This can reduce variance substantially for stable user-level metrics like sessions, impressions, or revenue, but it should use covariates measured before treatment to avoid post-treatment bias.
Always check experiment health before reading results. Diagnose sample ratio mismatch using a chi-square test, verify assignment logging, check exposure rates, compare pre-treatment covariates, inspect bot/spam filtering, and confirm metric pipelines. SRM is often a stop sign: a significant imbalance can indicate broken randomization, eligibility bugs, or logging loss.
Peeking creates false positives unless controlled. If teams monitor daily results and stop when significant, the nominal $\alpha=0.05$ is invalid. Use fixed-horizon analysis, alpha-spending methods, group sequential tests, always-valid p-values, or Bayesian decision rules. Practical launch reviews often combine statistical evidence with ramp-stage guardrail monitoring.
Duration matters beyond sample size. Run long enough to capture weekly seasonality, novelty effects, learning effects, delayed retention, and ecosystem responses. A Meta product test often needs at least one full weekly cycle; changes affecting retention, notifications, creators, or ads auctions may require longer holdouts or follow-up analysis.
Multiple testing changes interpretation. If you test many variants, segments, or metrics, expect false positives. Use Bonferroni or Holm for strict family-wise error control, Benjamini-Hochberg for false discovery rate, or pre-register a primary metric and treat segments as diagnostic. Avoid launching because “one segment was significant” without a prior hypothesis.
Launch decisions should quantify practical significance. A statistically significant +0.02% lift may be irrelevant if engineering cost, latency, or integrity risk is high; a non-significant but directionally positive result may still justify another iteration if the confidence interval excludes large harm. Report confidence intervals, not just p-values.
Ramping is part of the experiment design. Start with small exposure, e.g. 1% or less, to catch severe regressions in crashes, latency, abuse, or revenue; then ramp to 5%, 10%, 50%, and full launch as confidence grows. At each stage, monitor guardrails and avoid repeatedly redefining success criteria.

Worked example

Should we launch a new Facebook Feed ranking model?

A strong candidate would first clarify the product goal: is this ranking model intended to increase meaningful engagement, reduce low-quality content, improve retention, or increase ad value? They would also ask who is eligible, whether the model changes only ranking or also content inventory, and whether there are network effects between treated and untreated users. The answer should be organized around five pillars: experimental unit and population, metrics, power/duration, validity checks, and launch decision criteria. For randomization, user-level assignment is likely appropriate, but the candidate should flag interference because treated users may comment, share, or message untreated friends, partially contaminating outcomes. For metrics, they might choose a primary metric such as meaningful interactions per user or long-term retention proxy, while guarding against negative feedback, hides, unfollows, integrity violations, ads revenue, and latency. They should specify that the test should run at least through weekly seasonality and possibly longer if ranking effects alter user habits over time. One explicit tradeoff is between optimizing short-term engagement and protecting long-term satisfaction: more time spent is not automatically good if it comes with increased hides, reports, or lower next-week return. Before recommending launch, they would check SRM, exposure logging, pre-period balance, heterogeneous effects, and confidence intervals around both success and guardrail metrics. A strong close would be: “If I had more time, I’d add a longer-term holdout or post-launch monitoring plan to detect delayed retention and ecosystem effects.”

A second angle

Design an A/B test for Instagram Stories notifications

The same principles apply, but notifications introduce different constraints: treatment can directly affect user attention, opt-outs, and fatigue. The randomization unit should usually be user-level, and the candidate should separate send-side metrics, such as notification open rate, from receiver-side and long-term metrics, such as sessions, retention, notification disables, and uninstall rate. The biggest design risk is optimizing for short-term opens while harming trust or increasing churn. Unlike a ranking model, there may also be send-frequency caps, time-of-day effects, and interference if notifications drive replies to friends. The launch decision should require improvement in meaningful downstream engagement without regressions in opt-out, complaint, or retention guardrails.

Common pitfalls

Analytical mistake: treating statistical significance as the launch decision. A tempting answer is “if p-value < 0.05 on engagement, launch.” That ignores practical significance, multiple metrics, novelty effects, and guardrails. A better answer discusses effect size, confidence intervals, pre-specified criteria, and whether the observed lift is worth the operational and product risk.

Communication mistake: jumping into formulas before clarifying the product goal. Interviewers expect structure, but they also want product judgment. Starting with sample size equations before asking what the feature is trying to accomplish makes the answer feel generic. Lead with the decision, users affected, success definition, and risks; then use statistical details to support the plan.

Depth mistake: ignoring experiment validity checks. Many candidates describe randomization and metrics but skip SRM, logging quality, exposure consistency, pre-treatment balance, and ramp monitoring. At Meta scale, instrumentation bugs and eligibility mismatches are common enough that health checks are not optional. Say explicitly that you would not interpret treatment effects until the experiment passes these checks.

Connections

Interviewers may pivot from this topic into causal inference, especially interference, difference-in-differences, or observational analysis when randomization is infeasible. They may also go deeper on metric design, power analysis, heterogeneous treatment effects, sequential testing, or marketplace/network experiments. For product sense rounds, expect follow-ups on how to balance engagement, retention, revenue, integrity, and user trust.