Interpreting 30 Parallel A/B Tests With Mixed Results
Context
-
You ran 30 independent within-company A/B tests of a new message-sending feature (50/50 split).
-
Per-test significance level: α = 0.05 (assume two-sided unless pre-specified otherwise).
-
Outcomes: 2 significant uplifts, 1 significant decline, and 27 non-significant effects.
Tasks
-
What conclusions can you draw from these results before and after accounting for multiple testing?
-
How should multiple testing be addressed (family-wise error vs. false discovery rate), and which correction methods (e.g., Bonferroni, Holm, Benjamini–Hochberg) would you apply before declaring the feature effective?
-
What are the implications for rollout decisions?
Hint: Compare expected false positives (30 × 0.05) with observed, discuss family-wise vs. FDR control, and explain implications for rollout decisions.