Run a clean A/B test for autocomplete

Q: Run a clean A/B test for autocomplete

This question evaluates experimental-design and causal-inference competency for online A/B testing of ML-ranked autocomplete, covering metric formulation, variance-reduction strategies, sample-size computation, and operational validity considerations.

Q: How do I approach Analytics & Experimentation interview questions?

Analytics & Experimentation questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master analytics & experimentation interviews.

Question

Loading...

Design an online controlled experiment for ML-ranked autocomplete and search satisfaction

You are planning an A/B test to measure the impact of an ML-ranked autocomplete (typeahead) system on user search satisfaction. Provide a complete experimental plan that covers the following:

Design and Assignment

Define the unit of randomization and explain why it is appropriate.
Describe bucketing (hashing, number of buckets, persistence) and identity resolution.
Specify exposure rules across devices/sessions to prevent contamination (e.g., cross-device consistency, cache key isolation, model snapshotting) and eligibility criteria.
Propose a traffic ramp plan (phasing, gating criteria, rollback conditions).

Metrics (with exact formulas)

Choose one primary metric and at least three guardrail metrics relevant to typeahead and search.
Provide exact formulas for:
- Session-level query success rate (primary)
- Time to first click (TTFC)
- p99 typeahead latency
- Typeahead error rate
- Any other guardrails you select (e.g., zero-results rate, abandonment, suggestion CTR)
Include how you will handle quantiles and clustering.

Variance Reduction and Analysis

Describe CUPED or other variance-reduction strategies (variables, window, estimator) and how they apply at the chosen analysis unit.
Specify the statistical test, standard errors, and clustering.
Justify sequential monitoring without inflating Type I error.

Sample Size

Compute the required sample size to detect a 1.0 percentage-point absolute lift from a 60.0% baseline at alpha=0.05 and power=0.80. Show the formula and the numeric result. Discuss any design-effect and CUPED adjustments.

Operational Risks and Validity

Explain how you will handle novelty and carryover effects, bots and spam traffic, missing logs, seasonality, and heterogeneous treatment effects (locale, device).
Propose at least one falsification (placebo/negative control) check and an offline backtest plan.