Design an online controlled experiment for ML-ranked autocomplete and search satisfaction
You are planning an A/B test to measure the impact of an ML-ranked autocomplete (typeahead) system on user search satisfaction. Provide a complete experimental plan that covers the following:
Design and Assignment
-
Define the unit of randomization and explain why it is appropriate.
-
Describe bucketing (hashing, number of buckets, persistence) and identity resolution.
-
Specify exposure rules across devices/sessions to prevent contamination (e.g., cross-device consistency, cache key isolation, model snapshotting) and eligibility criteria.
-
Propose a traffic ramp plan (phasing, gating criteria, rollback conditions).
Metrics (with exact formulas)
-
Choose one primary metric and at least three guardrail metrics relevant to typeahead and search.
-
Provide exact formulas for:
-
Session-level query success rate (primary)
-
Time to first click (TTFC)
-
p99 typeahead latency
-
Typeahead error rate
-
Any other guardrails you select (e.g., zero-results rate, abandonment, suggestion CTR)
-
Include how you will handle quantiles and clustering.
Variance Reduction and Analysis
-
Describe CUPED or other variance-reduction strategies (variables, window, estimator) and how they apply at the chosen analysis unit.
-
Specify the statistical test, standard errors, and clustering.
-
Justify sequential monitoring without inflating Type I error.
Sample Size
-
Compute the required sample size to detect a 1.0 percentage-point absolute lift from a 60.0% baseline at alpha=0.05 and power=0.80. Show the formula and the numeric result. Discuss any design-effect and CUPED adjustments.
Operational Risks and Validity
-
Explain how you will handle novelty and carryover effects, bots and spam traffic, missing logs, seasonality, and heterogeneous treatment effects (locale, device).
-
Propose at least one falsification (placebo/negative control) check and an offline backtest plan.