Context
You ran a 1-week A/B test of a new search ranking with clustered randomization at the DMA level: 100 DMAs total (50 control, 50 treatment). Outcomes are aggregated from per-order/per-session data. Unless stated, assume no Sample Ratio Mismatch (SRM).
Given summaries:
-
Orders: control = 1,000,000; treatment = 1,050,000
-
Mean delivery time (minutes): control = 32.4 (SD = 9.1); treatment = 31.9 (SD = 9.5)
-
Cancellation rate: control = 3.2%; treatment = 3.5%
-
Baseline conversion: 15% (per session), target MDE for conversion = +0.3 percentage points (pp)
-
Intra-cluster correlation (ICC) across stores within a DMA for conversion = 0.15
Assumptions to complete missing context:
-
Cluster-robust inference is at the DMA level. Where DMA-level variance of cluster means is not provided, we approximate it using within-arm SDs and average per-DMA sample sizes, noting this can be optimistic if there is between-DMA heterogeneity.
-
For cancellation and delivery time, we treat orders as the unit of analysis; for power/MDE, sessions are the relevant unit for conversion.
Tasks
-
Difference in mean delivery time: compute the treatment–control difference and a 95% CI using a cluster-robust approach at the DMA level. State the estimator and SE formula you use, and report the test statistic and p-value.
-
SRM check: run a chi-squared test on assignment counts using per-DMA exposure. What threshold flags SRM at α = 0.05? If flagged, how would you diagnose?
-
Guardrail interpretation: despite faster delivery, cancellations rose by 0.3 pp. Conduct a two-proportion z-test and a cluster-adjusted variant. Quantify practical significance (risk difference and relative risk) and assess the guardrail “no increase > 0.2 pp (95% CI).”
-
Power/MDE: With 50 DMAs per arm and ICC = 0.15 (for conversion), compute the design effect and the required per-DMA sample to detect a +0.3 pp lift at 80% power, α = 0.05. Show formulas and numeric results.
-
Multiple metrics: You tracked 5 secondary metrics. Propose a Benjamini–Hochberg FDR = 10% correction and illustrate with hypothetical p-values. When would you instead prefer Holm–Bonferroni?
-
Sensitivity: A mid-week outage hit 5 treatment DMAs. Explain a pre-registered difference-in-differences using last week as pre-period and weather/outage covariates, avoiding post-treatment bias. Provide the regression with DMA and day fixed effects.
-
CUPED: Define a high-R² covariate (e.g., prior-week DMA mean delivery time) and write the CUPED-adjusted estimator for the treatment effect.