Interpret A/B results with p-values and uncertainty
Company: Upstart
Role: Data Scientist
Category: Statistics & Math
Difficulty: medium
Interview Round: HR Screen
You ran the experiment for 14 days (2025-08-15 to 2025-08-28) with 1:1 allocation, N_control = 500,000 users, N_treatment = 500,000 users. Summarized results:
- Sessions/user: control 3.20, treatment 3.28; relative lift +2.5%; SE(lift) 1.2%; p=0.032; desired direction up; not a guardrail.
- 7-day retention rate: control 28.0%, treatment 28.6%; absolute diff +0.6 pp; SE 0.35 pp; p=0.078; desired up.
- Video CTR: control 4.0%, treatment 4.6%; relative lift +15.0%; SE 4.5%; p=0.004; desired up.
- Hide rate: control 1.80%, treatment 2.05%; relative lift +13.9% (worse); SE 5.0%; p=0.011; guardrail yes.
- Time per session: control 5.80 min, treatment 5.95 min; relative lift +2.6%; SE 1.5%; p=0.092; desired up.
Answer:
1) For each metric, construct a two-sided 95% confidence interval using the provided effect size and SE, and interpret whether it excludes no effect.
2) Apply the Benjamini–Hochberg procedure at FDR 5% across the five p-values. Which metrics remain significant? Show your steps.
3) Discuss statistical vs. practical significance for Video CTR and Sessions/user; include a back-of-the-envelope estimate of incremental engaged sessions per day if rolled to 100% of US new users (state any reasonable assumption you need).
4) Hide rate is a guardrail and increased significantly. Quantify the expected absolute change (in pp) and discuss Type I/II risks, Type S/M errors, and whether this should block rollout despite other gains.
5) Power check: Assuming baseline 7-day retention = 28% and target MDE = +0.5 pp absolute at α=0.05 (two-sided) and 80% power, estimate the required per-variant sample size using a normal approximation. Is the current experiment sufficiently powered for that MDE?
6) Provide a concise go/no-go recommendation with rationale and any follow-up analyses you would run (e.g., heterogeneity by new vs. existing users, device, or pin_format).
Quick Answer: This question evaluates proficiency in statistical inference for A/B testing, covering confidence intervals, p-values, multiple-testing correction (Benjamini–Hochberg), effect-size interpretation, power/sample-size calculation, and guardrail risk assessment.