Formulate hypotheses and compute AB test significance
Company: Uber
Role: Data Scientist
Category: Statistics & Math
Difficulty: hard
Interview Round: Technical Screen
Using the following A/B test snapshot for the pickup ETA card experiment, answer all parts.
Data (7-day snapshot):
- Primary metric (trip completion rate per request):
• Control A: nA = 50,000 requests, cA = 6,000 completions
• Treatment B: nB = 50,000 requests, cB = 6,420 completions
- Guardrail 1 (rider cancel rate per request):
• Control A: cancelsA = 4,500
• Treatment B: cancelsB = 4,950
- Guardrail 2 (wait time minutes, per request):
• A: meanA = 4.8, sdA = 3.2, nA = 50,000
• B: meanB = 4.7, sdB = 3.4, nB = 50,000
- There were 5 interim looks at equally spaced information times with no pre-registered alpha spending.
Tasks:
1) State precise H0 and H1 for the primary metric; specify one- vs two-sided and justify.
2) Choose the appropriate test for the primary metric (difference in proportions) and compute: test statistic, p-value, and a 95% CI for the lift. Show formulas and numeric results.
3) For Guardrail 2 (mean wait time), select the correct test (e.g., Welch’s t-test) and compute the 95% CI of the mean difference. State any distributional assumptions and why Welch vs pooled.
4) Perform a multiple-testing correction across the three outcomes (Primary, Guardrail 1, Guardrail 2) using Holm–Bonferroni at familywise α = 0.05. Identify which effects remain significant.
5) Explain, in plain language, what the p-value you computed in (2) does and does not mean.
6) Given the unplanned 5 interim looks, re-evaluate significance using a simple Pocock or O’Brien–Fleming alpha-spending approach (outline the approach and provide an approximate adjusted conclusion; exact boundaries not required but justify your decision).
7) If pre-period completion rate per rider has correlation r = 0.40 with the in-experiment outcome, estimate the approximate variance reduction from CUPED and discuss how that would change required sample size or interpretation.
8) Conclude: ship, iterate, or stop? Defend your decision considering the guardrails.
Quick Answer: This question evaluates a data scientist's competency in experimental design and statistical inference for A/B testing, covering hypothesis formulation, difference-in-proportions testing and confidence intervals, guardrail analysis, multiple-testing correction, interim alpha-spending approaches, and variance-reduction techniques such as CUPED.