Scenario
You are the data scientist for a consumer health product running ~100 concurrent A/B tests under tight timelines and limited traffic. Leadership wants to understand how you design experiments, control error rates across many tests, adapt decisions as data accumulates, and choose robust product metrics.
Task
-
Portfolio-level error control with low traffic:
-
You must run roughly 100 experiments but cannot give each enough traffic to reach the usual 0.05 per-test significance. How would you set and manage the overall Type I error threshold so that the features you launch are as impactful as possible? Discuss trade-offs and specific procedures.
-
Adaptive decision-making:
-
After seeing preliminary results, how would you incorporate an adaptive approach—such as a multi-armed bandit—to update decision thresholds over time while preserving valid inference? Outline the mechanics and safeguards.
-
Metric design:
-
For a new feature, define a primary success metric you would track, explain why it matters, and describe how you would guard against metric swamping or gaming. Include guardrail metrics and practical launch criteria.
Consider multiple-testing corrections (Bonferroni, Holm, FDR), sequential testing, power vs. speed trade-offs, and principled metric selection (north-star and guardrails).