Ranking/Retrieval A/B Testing And Online Metrics
Asked of: Data Scientist
Last updated

What's being tested
Ability to design, instrument, and interpret online experiments for ranking/retrieval systems: selecting the right online metrics, handling position and selection bias, choosing randomization and bucketing, and reasoning about statistical power and guardrails.
Core knowledge
- NDCG/DCG as offline relevance metrics; position-weighted and sensitive to graded relevance.
- CTR, dwell time, long-clicks, retention as online proxies; business vs engagement tradeoffs.
- Position bias and IPS (inverse propensity scoring) for debiasing click data.
- User-level randomization to avoid contamination; impression-level causes correlated errors.
- Interleaving/A/B/n and multi-armed bandits for rapid signal with exploration-exploitation tradeoffs.
- Power calculations: baseline rate, MDE, variance, and required sample size per cohort.
- Instrumentation: consistent exposure logging (query, item-id, position, model-version, latency).
- Guardrail metrics: error rate, p95 latency, DAU/retention to catch harmful side-effects.
- Multiple testing and peeking risks: corrections, sequential testing frameworks (alpha spending).
Worked example — "Design an A/B test to evaluate a new ranking model for feed"
Start by defining a single primary online metric aligned to product goals (e.g., 7-day retained users or session length) and 2–3 guardrails (CTR, p95 latency, error rate). Randomize at user level using a keyed hash to ensure consistent assignment across devices and time; avoid impression-level buckets. Compute power given baseline metric, expected minimum detectable effect (MDE), and seasonality to size traffic and duration. Instrument complete exposure streams (query id, candidates, model version, rank, clicks, dwell) so you can re-aggregate by item and debias with IPS if needed. Plan ramping (small percent → safety monitors → full roll) and predefine heterogeneity analyses (device, geography, new vs. active users).
A common pitfall
Focusing only on immediate CTR uplift and declaring victory is tempting but misleading: CTR increases can come from clickbait or novelty and harm long-term retention, latency, or content diversity. Similarly, randomizing impressions or failing to log exposures prevents debiasing and invalidates causal claims. Always check guardrails, pre-specify durations, and ensure instrumentation captures the full interaction trace.
Further reading
- Kohavi, Longbotham, et al., "Trustworthy Online Controlled Experiments" (practical guidance on experiments and pitfalls).
- Joachims et al., "Unbiased Learning-to-Rank with Biased Feedback" (techniques for debiasing click data).