Powering Two Online Experiments: Sample Size, Duration, and Design Defenses
You are designing experiments to improve a friend-accept rate metric in a high-traffic consumer app. Assume independent users, a Bernoulli outcome (accept vs not), and that traffic estimates refer to unique eligible users per day.
Baseline accept rate: 9.0% (p0 = 0.09)
Eligible traffic: 10M users/day
Scenario A — Two-arm A/B Test
-
Goal: Detect an absolute lift of +0.5 percentage points (Δ = +0.005) in accept rate.
-
Test: Two-sided, α = 0.05; power = 0.80; equal allocation (50/50).
Tasks:
-
Compute the per-arm sample size and the calendar days required.
-
Recompute both if the true standard deviation is 20% higher than assumed.
Scenario B — Three Arms with Shared Control
-
Arms: Control (C), Treatment 1 (T1), Treatment 2 (T2)
-
Allocation: 20% to C, 40% to T1, 40% to T2
-
Goal: Detect +0.3 pp (Δ = +0.003) for either T1 vs C or T2 vs C with familywise α = 0.05 (two-sided).
Tasks:
-
Choose and justify Holm vs Bonferroni vs Dunnett for multiple comparisons.
-
Recompute per-arm sample sizes and total duration under your choice.
For Both Scenarios
-
Discuss the consequences of daily peeking with a naive p < 0.05 rule; propose a sequential design (e.g., O’Brien–Fleming) and how it changes stopping boundaries.
-
Explain when you’d switch to user-level CUPED or cluster-robust SEs (e.g., feed-level clustering) and how that affects sample size.