Run a clean A/B test for autocomplete
Company: Etsy
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: hard
Interview Round: Technical Screen
Plan an online controlled experiment to measure the impact of ML-ranked autocomplete on user search satisfaction. Define the unit of randomization (and why), bucketing, exposure rules for typeahead across devices/sessions to prevent contamination, and a traffic ramp plan. Choose primary and guardrail metrics, specify exact formulas (e.g., session-level query success rate, time-to-first-click, p99 latency, error rate), and include CUPED or variance-reduction details. Compute the required sample size to detect a 1.0 percentage-point absolute lift from a 60.0% baseline at alpha=0.05 and power=0.80, and justify sequential monitoring without inflating Type I error. Describe how you will handle novelty and carryover effects, bots, missing logs, seasonality, and heterogeneous treatment effects (locale/device), plus a falsification check and backtest plan.
Quick Answer: This question evaluates experimental-design and causal-inference competency for online A/B testing of ML-ranked autocomplete, covering metric formulation, variance-reduction strategies, sample-size computation, and operational validity considerations.