A/B Test Design And Analysis
Asked of: Data Scientist
Last updated

What's being tested
Ability to design unbiased, well-powered experiments and analyze results to support causal claims in product settings. Interviewers look for correct unit-of-randomization, metric definition, power calculations, variance-reduction, and guarding against common bias modes (peeking, interference, instrumentation errors).
Core knowledge
- Hypothesis framing: explicit treatment, control, primary metric, and success criterion (minimum detectable effect).
- Randomization unit: user, session, or device; implications for SUTVA and intraclass correlation.
- Sample size / power: use z/t test approximations, account for baseline rate and variance, include multiple-arm adjustments.
- Type I/II tradeoffs, p-values vs confidence intervals, and pre-registration to avoid p-hacking.
- Multiple comparisons: Bonferroni, BH-FDR, and hierarchical testing strategies.
- Sequential testing: alpha-spending, O’Brien–Fleming or Pocock boundaries, or platform tools (sequential-test infrastructure).
- Variance reduction: CUPED (covariate adjustment), blocking/stratification, and metric transformations (log, winsorization).
Worked example
(Framing a product A/B design) Start by defining the causal question and a single primary metric (e.g., seven-day retention). Choose the unit-of-randomization (user ID if persistent effects expected), and justify to avoid cross-contamination. Compute baseline rate, pick a minimum detectable effect, and calculate required sample size and expected duration given traffic. Pre-specify analysis: ITT as primary, secondary subgroups, variance-reduction plans (CUPED), and an alpha/spending plan for any interim looks. Finally list failure modes to monitor: instrumentation loss, imbalance, novelty effects, and metric ratio instabilities.
A common pitfall
A tempting but wrong approach is chasing statistically significant but practically meaningless lifts: designing for tiny MDEs without realistic traffic, then over-interpreting noisy metric ratios. Equally common is randomizing at the session while measuring user-level outcomes, which underestimates variance and inflates false positives. Always align randomization unit with metric aggregation and pre-specify analysis to avoid these traps.
Further reading
- Kohavi, Longbotham, et al., "Online Controlled Experiments at Large Scale" (practical overview).
- Deng et al., "Improving the Sensitivity of Online Experiments with CUPED" (variance-reduction technique).