Ship/No-Ship Decision: Type I/II Errors, Cost-Sensitive Testing, Sample Size, and Multiple Testing
Context
You are deciding whether to ship a new dispatch algorithm based on an A/B test. The business outcome of interest is gross bookings improvement. You will use a binary decision: ship or do not ship.
-
Beneficial threshold (effect of interest): at least +1% absolute improvement in the primary metric.
-
Prior probability that the true effect is beneficial (≥ +1%): p = 0.30.
-
Costs:
-
False Positive (Type I): ship when the algorithm is not beneficial (effect < +1%); expected cost C_FP = $500,000.
-
False Negative (Type II): do not ship when the algorithm is beneficial (effect ≥ +1%); expected cost C_FN = $50,000.
Assume an A/B test on a proportion metric (e.g., conversion), with equal allocation per arm.
Tasks
(a) Define Type I (false positive, alpha) and Type II (false negative, beta) errors in this ship/no-ship context. Which error is worse in expectation and why?
(b) Derive the decision rule that minimizes expected loss in terms of alpha and beta. Show explicitly how p, C_FP, and C_FN enter the expression.
(c) Choose concrete values for alpha and power (1 − beta) that are consistent with your cost-sensitive rule and justify them.
(d) Using your chosen alpha and power, compute the minimum sample size per arm for a two-sided difference-in-means test on a proportion metric with baseline conversion rate = 10% and minimum detectable effect (MDE) = +1% absolute (i.e., 10% vs 11%). State any standard approximations you use.
(e) If you must simultaneously test 10 metrics, explain how your alpha/power and sample size change under:
-
Family-Wise Error Rate (FWER) control via Bonferroni.
-
False Discovery Rate (FDR) control via Benjamini–Hochberg (BH).