A/B Testing for Retrieval and Ranking (Search/Feed)
Asked of: Data Scientist
Last updated

-
What it is A/B testing for retrieval and ranking compares two or more candidate systems for getting items (retrieval) and ordering them (ranking) by randomly allocating real traffic and measuring impact on user- and business-meaningful metrics. In search and feeds, it answers questions like “Does this new recall filter or ranker improve relevance without hurting session health or revenue?”
-
Why interviewers ask about it Data Scientists at companies like Meta are expected to design trustworthy experiments for ranking-heavy products (Feed, Search, Reels) and translate results into product decisions. They look for signal on experiment design under interference, metric design (short- vs long-term), and techniques that detect improvements quickly without shipping regressions to millions of people. (engineering.fb.com)
-
Core ideas to know
- Define an Overall Evaluation Criterion (OEC) plus guardrails (e.g., quality, integrity, latency, ads) before you launch. (cambridge.org)
- Randomize at the right unit to avoid contamination: user-level for feeds; often query/session-level for search.
- Use variance reduction (pre-experiment covariates) and appropriate sample sizing; avoid peeking unless using anytime-valid methods.
- Retrieval metrics (Recall@K, MRR) differ from ranking metrics (NDCG, clicks, dwell); align them with your OEC.
- Interleaving (e.g., Team Draft) mixes two ranked lists per request to get faster, lower-variance online preferences than full A/B. (microsoft.com)
- Counterfactual/offline checks (propensity-weighted estimators) help screen rankers before risking online traffic. (arxiv.org)
- Health checks: Sample Ratio Mismatch (SRM), event-loss audits, and latency budgets catch instrumentation/platform issues early. (cambridge.org)
-
A common pitfall Candidates optimize for short-term click metrics and declare victory, only to learn the change degraded session quality or creator/advertiser outcomes a week later. Novelty and carryover effects are real in feeds: a shiny ranker can spike engagement temporarily while increasing hide/report rates or depressing long-term retention. Without a clear OEC, guardrails, and post-experiment holdout validation, teams ship regressions that “win” online but harm ecosystem health. Mention how you’d detect and mitigate these (pre-registered metrics, ramp plans, holdbacks, and follow-up AA tests). (cambridge.org)
-
Further reading
- Trustworthy Online Controlled Experiments (Kohavi, Tang, Xu) — canonical playbook on OECs, SRM, novelty/carryover, and platform checks. (cambridge.org)
- Large-Scale Validation and Analysis of Interleaved Search Evaluation (Chapelle, Joachims, et al., Microsoft/Bing) — why interleaving detects ranking differences quickly and robustly. (microsoft.com)
- News Feed ranking, powered by machine learning (Meta Engineering) — concrete view of large-scale feed ranking systems and constraints your experiments must respect. (engineering.fb.com)