Multiple Testing and Sequential Monitoring
Asked of: Data Scientist
Last updated

-
What it is Running many hypothesis tests (variants, metrics, or audience slices) inflates false positives; that’s the multiple testing problem. Sequential monitoring is repeatedly checking results and stopping early, which breaks classical fixed-sample p-values unless adjusted.
-
Why interviewers ask about it Large platforms run thousands of overlapping experiments and track dozens of correlated metrics and segments. Knowing how to control error rates under many comparisons and continuous peeking is essential to ship safely and quickly at scale.
-
Core ideas to know
- Families of tests: define what’s “in scope” (variants, metrics, slices) before analysis to control error rates meaningfully.
- Error rates: FWER vs FDR; Bonferroni (conservative) vs Benjamini–Hochberg (more power) for many correlated metrics.
- Peeking problem: repeated looks inflate Type I error unless you use group-sequential boundaries or anytime-valid methods.
- Anytime-valid approaches: mSPRT, always-valid p-values, confidence sequences; valid under continuous monitoring and arbitrary stopping.
- Alpha spending: O’Brien–Fleming/Pocock boundaries for pre-planned interim looks in long-running tests.
- Cross-experiment control: online FDR when many experiments run over time on shared users.
- Practical workflow: pre-register primary metric and stopping rule; correct secondary metrics and slices; monitor guardrails with appropriate adjustments.
-
A common pitfall Candidates say “I’ll just check daily and stop when p<0.05,” or “I’ll slice by country until something is significant.” Both inflate false positives. Another miss is applying Benjamini–Hochberg to daily peeks as if they were independent tests; BH controls across hypotheses, not across time looks. Finally, failing to predefine the family (e.g., primary metric plus key slices) leads to under-correction and launches that regress when rolled out.
-
Further reading
- Johari, Pekelis, Walsh (2022) — Always Valid Inference: Continuous Monitoring of A/B Tests. Operations Research. Why: Rigorous, practitioner-focused treatment of anytime-valid p-values for sequential monitoring. https://pubsonline.informs.org/doi/10.1287/opre.2021.2135
- Ding et al. (2023) — Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing Methodology. Why: Survey covering mSPRT, sequential designs, and multiple testing at scale. https://arxiv.org/abs/2212.11366
- Kohavi, Tang, Xu (2020) — Trustworthy Online Controlled Experiments. Why: Industry textbook on building experimentation culture; chapters on peeking, guardrails, and multiple comparisons. https://www.cambridge.org/core/books/trustworthy-online-controlled-experiments/D97B26382EB0EB2DC2019A7A7B518F59