Diagnose uplift drop in email A/B tests
Company: Coinbase
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: hard
Interview Round: Onsite
An e-commerce company is testing personalized product emails to improve 7-day purchase conversion. Design the experiment and then debug conflicting rerun results.
Part A — Design and sizing
1) Define a precise primary metric and 2–3 guardrails. Assume intent-to-treat on a user-level randomization. State exposure/eligibility rules and how to handle multiple emails per user.
2) Sample size: Baseline 7-day purchase conversion is 3.5%. Detect a 10% relative lift (two-sided α=0.05, power=0.80), 1:1 allocation. With 500,000 eligible users/day and 85% deliverability, how many calendar days must the test run (include a full 7-day attribution window)? Show the formula and numeric result.
3) Now suppose the business wants to power for 7-day revenue per randomized user (mean $0.90, SD $12.00). Detect a +$0.10 absolute lift with the same α and power. What per-arm sample size and run length does this imply?
Part B — Conflicting results and diagnostics
The initial RCT ran 2025-06-01 to 2025-06-14 with per-arm n=1,200,000. Control conversion=3.50%, Treatment=4.20% (+20.0% relative, +0.70 pp). A rerun on 2025-08-15 to 2025-08-28 with per-arm n=900,000 observed Control=3.50%, Treatment=3.57% (+2.0% relative, +0.07 pp).
4) For each test, compute the two-proportion z-test p-value and 95% CI for the absolute lift; then compute a fixed-effects meta-analytic pooled lift across the two tests. Should you launch? Why?
5) List at least 6 plausible causes for the discrepancy (e.g., seasonality, targeting drift, novelty/creative fatigue, regression to the mean/winner’s curse, instrumentation/attribution changes, concurrency with promos, contamination, different triggered eligibility, population mix-shift). For each, specify 1–2 concrete checks (SQL or plots) you would run and the exact data you’d need.
6) Propose a re-analysis plan: pre-registration, CUPED or pre-period covariate adjustment, heterogeneity-of-treatment-effects by region/device/recency, sequential monitoring corrections, and a holdout strategy for ramp. Describe decisions you would make if the pooled lift is between +0% and +5%.
Quick Answer: This question evaluates a data scientist's competence in experimental design, metric definition and guardrail selection, power and sample-size calculations, statistical inference (including two-proportion testing and fixed-effects meta-analysis), and debugging inconsistent A/B test reruns through instrumentation, population-shift, and heterogeneity checks. It is commonly asked because interviewers must assess the ability to operationalize randomized email experiments, set run lengths and attribution windows, and diagnose conflicting results using applied analytics; the problem sits in the Analytics & Experimentation domain and tests practical application grounded in conceptual statistical understanding.