Plan and analyze a ranking A/B test
Company: Netflix
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: hard
Interview Round: Onsite
A search team proposes a new ranking feature. Design, execute, and analyze the experiment: (1) Unit of randomization: decide between user-level, session-level, or query-level, and justify given cross-session carryover and possible network/interference. (2) Metrics: define primary success metric (e.g., query-level success rate or paid conversion within 24h) and guardrails (latency, crash rate, ads revenue, bounce). (3) Power and sample size: baseline click-through rate is 10%; you need a relative +2% uplift (to 10.2%), two-sided alpha=0.05, power=0.8. Show the formula and compute the required per-variant sample size for a standard two-proportion z-test; then discuss how clustering or CUPED would change it. (4) Execution: outline SRM checks, triggered vs intent-to-treat analyses, bucketing consistency across services, novelty effects burn-in, and sequential monitoring without inflating Type I error. (5) Heterogeneity: propose pre-registered segments (e.g., head vs tail queries, country, device) and how you’d test for interaction while controlling false discovery. (6) Interference and long-term effects: if ranking changes affect supply/demand dynamics, propose cluster-randomization or switchback testing and how to interpret results. (7) Rollout: define stop/go criteria, ramp plan, and how to update the ML training data to avoid entangling training with experiment exposure.
Quick Answer: This question evaluates experimental-design and causal-inference competencies for online A/B testing, covering metric definition, randomization strategy under cross-session carryover and interference, power and sample-size calculations, sequential monitoring, heterogeneity analysis, and safe rollout and ML retraining considerations.