Interleaving and Multileaving for Ranking Comparison at Scale
Asked of: Data Scientist
Last updated

-
What it is Interleaving blends the results of two ranking models into a single list for a user session and attributes each click to the model that supplied that item. Because users see a shared page, position and user-mix biases largely cancel, making it a fast, low-traffic way to tell which ranker is better. Multileaving generalizes this to compare more than two rankers at once. (microsoft.com)
-
Why interviewers ask about it At companies with high experiment cost and limited A/B bandwidth (e.g., large e-commerce or social feeds), this technique accelerates iteration on ranking quality. Teams report big speedups versus traditional A/B for detecting small deltas, letting them gate expensive launches or decide which candidates deserve full-rollout tests. (medium.com)
-
Core ideas to know
- Team Draft Interleaving alternates picks from each ranker; each shown item is credited to its source. (microsoft.com)
- Credit assignment aims for unbiasedness; variants include Balanced, Probabilistic, and Optimized Interleaving. (microsoft.com)
- Shared-page exposure reduces variance versus A/B, detecting smaller effects with fewer users/sessions. (microsoft.com)
- Multileaving compares k>2 rankers simultaneously; examples: Team Draft Multileave and Optimized Multileave. (cs.ox.ac.uk)
- Handle duplicates and presentation: dedupe identical items and adapt for non-list UIs (grids/maps, federated pages). (cs.ox.ac.uk)
- Use when outcome is click-level or short-horizon; for long-lag conversions, combine with A/B or counterfactual evaluation. (medium.com)
-
A common pitfall Candidates often describe the merge but ignore attribution edge cases. If both rankers propose the same item, failing to assign it to exactly one source causes “credit leakage” and bias. Another miss: not considering collisions in multileaving when the number of rankers approaches the results-per-page, which weakens pairwise discriminability. Finally, people claim it “fixes” all click biases; it only cancels shared-page biases and still needs sensible click models and guardrails. (microsoft.com)
-
Further reading
- Radlinski & Craswell (WSDM 2013), Optimized Interleaving — Formalizes properties and an optimized algorithm; great for understanding bias/efficiency trade-offs. https://www.microsoft.com/en-us/research/wp-content/uploads/2013/02/Radlinski_Optimized_WSDM2013.pdf.pdf
- Schuth et al. (CIKM 2014), Multileaved Comparisons for Fast Online Evaluation — Introduces multileaving and analyzes accuracy, bias, and scalability. https://www.cs.ox.ac.uk/people/shimon.whiteson/pubs/schuthcikm14.pdf
- Airbnb Tech Blog (2022), Beyond A/B Test: Speeding up Airbnb Search Ranking Experimentation through Interleaving — Practical system details; reports up to 50× faster iteration. https://medium.com/airbnb-engineering/beyond-a-b-test-speeding-up-airbnb-search-ranking-experimentation-through-interleaving-7087afa09c8e