Design a robust email A/B test
Company: Uber
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: hard
Interview Round: Technical Screen
You own a weekly email campaign to 10M users. Baseline CTR is 3.0% and unsubscribes are 0.08% per send. Marketing proposes a new subject line expected to increase CTR by +6% relative (MDE ≈ +0.18pp absolute). Design the experiment end-to-end:
1) Randomization: What is the randomization unit and any stratification blocks you would use (e.g., locale, device, engagement tier)? How do you prevent contamination from resends and cross-campaign overlap in the same week?
2) Power: Compute per-arm sample size for α=0.05 (two-sided) and 80% power for detecting +0.18pp absolute lift on CTR from a 3.0% baseline, assuming independent Bernoulli outcomes at the user-send level. State your assumptions and show the formula you would use.
3) Metrics: Choose a single primary success metric and at least two guardrail metrics (e.g., unsubscribe rate, spam complaints). Define each precisely (numerator/denominator, window), and justify the choice over alternatives like open rate.
4) Sequential monitoring: Leadership wants daily peeks and the ability to stop early for harm. Propose a valid plan (e.g., alpha-spending or group-sequential boundaries) that controls the Type I error. Specify the monitoring schedule and stopping/continuation rules.
5) Mid-experiment checks: What diagnostics would you run after 48 hours to detect randomization failure, instrumentation delays, or traffic mix shifts (e.g., weekend effects)? How would you correct issues without biasing estimates?
6) Results handling: If the interim shows negative CTR lift but higher opens, enumerate at least three plausible causes and the next decision (continue, stop-for-harm, or redesign). Explain how you would handle intention-to-treat vs per-protocol and what you would report to stakeholders.
Quick Answer: This question evaluates a data scientist's competency in experimental design, including randomization and stratification, statistical power and sample-size estimation, primary and guardrail metric definition, sequential monitoring, and data-quality diagnostics for large-scale email A/B tests.