A/B Testing for Retrieval and Ranking (Search/Feed) — Tech Interview Concept

What it is A/B testing for retrieval and ranking compares two or more candidate systems for getting items (retrieval) and ordering them (ranking) by randomly allocating real traffic and measuring impact on user- and business-meaningful metrics. In search and feeds, it answers questions like “Does this new recall filter or ranker improve relevance without hurting session health or revenue?”
Why interviewers ask about it Data Scientists at companies like Meta are expected to design trustworthy experiments for ranking-heavy products (Feed, Search, Reels) and translate results into product decisions. They look for signal on experiment design under interference, metric design (short- vs long-term), and techniques that detect improvements quickly without shipping regressions to millions of people. (engineering.fb.com)
Core ideas to know

Define an Overall Evaluation Criterion (OEC) plus guardrails (e.g., quality, integrity, latency, ads) before you launch. (cambridge.org)
Randomize at the right unit to avoid contamination: user-level for feeds; often query/session-level for search.
Use variance reduction (pre-experiment covariates) and appropriate sample sizing; avoid peeking unless using anytime-valid methods.
Retrieval metrics (Recall@K, MRR) differ from ranking metrics (NDCG, clicks, dwell); align them with your OEC.
Interleaving (e.g., Team Draft) mixes two ranked lists per request to get faster, lower-variance online preferences than full A/B. (microsoft.com)
Counterfactual/offline checks (propensity-weighted estimators) help screen rankers before risking online traffic. (arxiv.org)
Health checks: Sample Ratio Mismatch (SRM), event-loss audits, and latency budgets catch instrumentation/platform issues early. (cambridge.org)

A common pitfall Candidates optimize for short-term click metrics and declare victory, only to learn the change degraded session quality or creator/advertiser outcomes a week later. Novelty and carryover effects are real in feeds: a shiny ranker can spike engagement temporarily while increasing hide/report rates or depressing long-term retention. Without a clear OEC, guardrails, and post-experiment holdout validation, teams ship regressions that “win” online but harm ecosystem health. Mention how you’d detect and mitigate these (pre-registered metrics, ramp plans, holdbacks, and follow-up AA tests). (cambridge.org)
Further reading

Trustworthy Online Controlled Experiments (Kohavi, Tang, Xu) — canonical playbook on OECs, SRM, novelty/carryover, and platform checks. (cambridge.org)
Large-Scale Validation and Analysis of Interleaved Search Evaluation (Chapelle, Joachims, et al., Microsoft/Bing) — why interleaving detects ranking differences quickly and robustly. (microsoft.com)
News Feed ranking, powered by machine learning (Meta Engineering) — concrete view of large-scale feed ranking systems and constraints your experiments must respect. (engineering.fb.com)

Related concepts