Ranking Metrics and Online Evaluation — Tech Interview Concept

What it is Ranking metrics are offline measures of how well a system orders items for a query or user, using labels or judgments (examples: NDCG, MAP, MRR, Precision@K). Online evaluation measures real user impact through controlled exposure (A/B tests) or faster head‑to‑head methods like interleaving, deciding whether a new ranker should ship.
Why interviewers ask about it Data Scientists at product companies care about turning offline gains into business impact without hurting guardrail metrics. They want to see you choose the right metric, explain trade‑offs, and design trustworthy experiments that detect real lifts for feeds, search, or recommendations under latency and risk constraints.
Core ideas to know

NDCG@k compares your ordering to the ideal; MAP summarizes precision across all relevant items; MRR emphasizes speed to first relevant hit.
DCG vs NDCG: DCG reflects absolute utility; NDCG normalizes for differing numbers of relevant items. Pick what matches user value.
Offline ≠ online: define an Overall Evaluation Criterion (OEC) and guardrails before launching; treat offline metrics as gates, not goals.
Online methods: A/B tests estimate absolute lift; interleaving quickly detects preference between rankers with less traffic.
Click logs are biased (position, selection). Use counterfactual estimators (IPS/DR) for safer offline comparisons.
Power, ramping, and stopping rules matter; avoid peeking and metric fishing to prevent false “wins.”
Watch for interference in social products; consider cluster randomization when users affect each other.

A common pitfall Candidates recite metric formulas but don’t connect them to product objectives. For example, they celebrate a higher NDCG without checking if it improves the OEC or hurts guardrails like session health or creator supply. Others A/B test on thin traffic, peek repeatedly, or choose a misaligned primary metric (e.g., CTR that increases accidental clicks). Strong answers tie metric choice to user utility, quantify detectable effect sizes, and explain how to mitigate bias and interference.
Further reading

Evaluate vector search retrieval quality (Microsoft Learn/Databricks) — clear, up-to-date definitions of DCG/NDCG/MAP/MRR and why DCG can be preferable operationally. (learn.microsoft.com)
Trustworthy Online Controlled Experiments (Kohavi, Tang, Xu, Cambridge 2020) — industry-standard guide to OECs, guardrails, pitfalls, and platform practices. (cambridge.org)
Optimized Interleaving for Online Retrieval Evaluation (Radlinski & Craswell, WSDM 2013) — foundational paper on interleaving’s sensitivity and unbiased comparison of rankers. (microsoft.com)

Related concepts