A/B Testing Stats: Confidence Intervals, Power, Multiple Testing, and Clustering
Context: You are planning an A/B experiment on a Bernoulli outcome (conversion). The baseline conversion rate is p0 = 0.051 measured on N0 = 100,000 users. You want to detect a +5% relative lift with two-sided α = 0.05 and 80% power.
Given:
-
Baseline p0 = 0.051, N0 = 100,000
-
Target lift: p1 = 1.05 × p0
-
Significance: α = 0.05 (two-sided); Power: 80%
Tasks:
-
Compute a 95% Wald confidence interval and a Wilson (or Agresti–Coull) interval for p0. Explain why Wilson/AC may be preferable.
-
Approximate the per-variant sample size required for 80% power to detect the target lift using a normal approximation for two proportions. State all formulas and assumptions.
-
If you simultaneously test three metrics (conversion, AOV, retention), apply a Bonferroni or Holm correction and provide the adjusted α for each. Discuss trade-offs vs FDR control (Benjamini–Hochberg).
-
Your data exhibits user-level clustering (repeat visitors). Explain why independence is violated, how to correct standard errors (e.g., cluster-robust SEs or user-level aggregation), and how that affects power.