Compute power and interpret guardrails
Company: DoorDash
Role: Data Scientist
Category: Statistics & Math
Difficulty: hard
Interview Round: Onsite
An A/B test of a new search ranking shipped for one week across 100 DMAs (50 control, 50 treatment). Summaries:
• Orders: control=1,000,000; treatment=1,050,000 (exposed by DMA-level randomization; assume no SRM unless shown).
• Mean delivery time (minutes): control=32.4 (SD=9.1), treatment=31.9 (SD=9.5).
• Cancellation rate: control=3.2%, treatment=3.5%.
• Baseline conversion: 15% (per session), target MDE=+0.3 pp.
• ICC across stores within a DMA: 0.15.
Tasks:
1) Compute the difference in mean delivery time and a 95% CI using a cluster-robust approach at the DMA level. State the exact estimator and SE formula you use and report the test statistic and p-value.
2) Check for SRM: run a chi-squared test on assignment counts using per-DMA exposure. What threshold flags SRM at α=0.05? How would you diagnose if flagged?
3) Guardrail interpretation: despite faster delivery, cancellations rose by 0.3 pp. Conduct a two-proportion z-test (and a cluster-adjusted variant). Quantify the practical significance (risk difference and relative risk) and whether this violates a pre-specified guardrail of “no increase >0.2 pp (95% CI).”
4) Power/MDE: With 50 DMAs per arm and the stated ICC, compute the design effect and the required per-DMA sample for detecting a +0.3 pp conversion lift at 80% power, α=0.05. Show formulas and numeric results.
5) Multiple metrics: You tracked 5 secondary metrics. Propose a Benjamini–Hochberg FDR=10% correction and illustrate with hypothetical p-values. When would you instead prefer Holm–Bonferroni?
6) Sensitivity: A mid-week outage hit 5 treatment DMAs. Explain a pre-registered diff-in-diff that uses last week as pre-period and weather/outage covariates, without introducing post-treatment bias. Include the regression specification with DMA and day fixed effects.
7) CUPED: Define a high-R² covariate (e.g., prior-week DMA mean delivery time) and write the CUPED-adjusted estimator for the treatment effect.
Quick Answer: This question evaluates competency in experimental design and applied statistics for cluster-randomized A/B tests, covering cluster-robust inference, mean and proportion comparisons, power/MDE calculations with ICC and design effects, multiple-testing control, and sensitivity adjustments such as difference-in-differences and CUPED.