Retrieval Quality And Offline Ranking Metrics

What's being tested

Interviewers are probing whether you can evaluate retrieval and ranking systems before launching them, especially when online A/B tests are expensive, risky, or slow. At Meta, many products depend on multi-stage ranking pipelines: search, Feed, Reels, ads, friend recommendations, Marketplace, notifications, and integrity classifiers all need to retrieve a manageable candidate set and order it well. The key skill is not reciting metric definitions, but choosing metrics that match the product goal, label quality, ranking stage, and business constraint. Strong candidates can explain where offline metrics are useful, where they fail, and how they connect to online outcomes like engagement, retention, revenue, creator health, or user trust.

Core knowledge

Retrieval and ranking are usually separate stages. Retrieval optimizes for finding a broad candidate set from millions or billions of items; ranking optimizes ordering among hundreds or thousands. Retrieval quality is often measured by recall@K, while final rankers are evaluated with NDCG, MRR, MAP, AUC, calibration, or task-specific utility.
Precision@K is $P@K = \frac{\text{relevant items in top }K}{K}.$ It is intuitive for surfaces with fixed slots, like top 10 search results, but ignores relevant items below K and treats all top-K positions equally unless paired with a position-aware metric.
Recall@K is $R@K = \frac{\text{relevant items retrieved in top }K}{\text{total relevant items}}.$ It is critical for candidate generation: if retrieval misses a good item, downstream ranking cannot recover it. In large-scale systems, retrieval recall is often computed against an expensive exact nearest-neighbor or full-corpus scoring baseline.
Mean Reciprocal Rank is $MRR = \frac{1}{|Q|}\sum_{q \in Q}\frac{1}{\text{rank}_q},$ where $\text{rank}_q$ is the position of the first relevant result. It fits navigational search or “find one good answer” tasks, but is weak when users value multiple relevant results, diversity, or long sessions.
DCG uses graded relevance and positional discounting: $DCG@K = \sum_{i=1}^{K}\frac{2^{rel_i}-1}{\log_2(i+1)}.$ NDCG normalizes by the ideal DCG: $NDCG@K = \frac{DCG@K}{IDCG@K}.$ It is useful when relevance labels are ordinal, such as bad / okay / good / excellent.
MAP averages precision at every relevant item’s rank: $AP = \frac{1}{R}\sum_{k=1}^{n} P@k \cdot \mathbf{1}\{item_k \text{ relevant}\}.$ MAP is common in information retrieval when many relevant documents exist, but it assumes binary relevance and can overweight queries with many judged positives.
AUC measures pairwise ordering probability: $P(score^+ > score^-)$ . It is threshold-independent and useful for binary classifiers, but can be misleading for top-heavy ranking because errors at rank 1 and rank 10,000 may contribute similarly. For feeds and search, NDCG@K or recall@K is usually more aligned.
Large-scale retrieval commonly uses approximate nearest neighbor indexes such as FAISS IVF-PQ, HNSW, ScaNN, or Annoy. Exact search may work up to roughly millions to low tens of millions of vectors depending on latency and hardware; at hundreds of millions or billions, ANN becomes necessary, trading recall for latency, memory, and freshness.
Offline labels are often biased because they come from logged exposure data. Clicks, likes, hides, dwell time, and purchases reflect what the previous model showed, not all possible items. Corrective approaches include randomized logging buckets, inverse propensity scoring, counterfactual evaluation, or human relevance judgments.
Ranking metrics should match the user interaction model. For search, users scan from top to bottom, so NDCG@10, MRR, and success@K are natural. For infinite feeds, session-level value, long-term satisfaction, negative feedback, diversity, and creator ecosystem effects may matter more than item-level click prediction.
Always evaluate slices. Overall NDCG can improve while degrading new users, low-resource languages, cold-start creators, minority communities, or rare query classes. Meta-scale systems require metrics by geography, language, device, network quality, user tenure, content type, and safety-sensitive segments.
Offline metric movement does not guarantee online improvement. Distribution shift, feedback loops, exploration effects, latency regressions, and proxy-label gaming can break the relationship. A strong answer explains how to validate offline-online correlation using historical launches, interleaving tests, shadow deployments, or small A/B experiments.

Worked example

For “Evaluate Search Ranking Quality Offline”, a strong candidate would start by clarifying the search surface: is this people search, posts, groups, Marketplace, or general app search, and is the goal finding one exact result or browsing multiple relevant results? They would also ask what labels are available: human judgments, clicks, long clicks, follows, purchases, joins, hides, or query reformulations. The answer should be organized around four pillars: dataset construction, relevance labeling, metric choice, and validation against online outcomes. For dataset construction, they would sample representative queries, preserve production candidate sets, and include important slices such as head versus tail queries, languages, geography, and new versus returning users. For metric choice, they might propose MRR or success@1 for navigational queries, NDCG@10 for graded relevance, and recall@K if evaluating the retrieval stage. They would explicitly flag that click labels are position-biased: a top-ranked mediocre result may receive more clicks than a lower-ranked excellent result, so naive click-based NDCG can reward the old model’s exposure pattern. A concrete design decision is whether to use human relevance judgments, which are cleaner but expensive and may not capture personalization, or logged engagement labels, which are scalable but biased. They would close by saying that if they had more time, they would estimate offline-online correlation from past launches and run a small randomized or interleaving experiment to ensure the offline metric predicts user satisfaction.

A second angle

For “Compare Two Feed Ranking Models Using Offline Metrics”, the same evaluation ideas apply, but the framing changes because Feed is not a simple query-result page. Instead of one query with a small top-K list, the unit may be user-session, user-day, or impression, and relevance may include likes, comments, shares, dwell time, hides, reports, and downstream retention. NDCG can still help for top-of-feed ordering, but a candidate should discuss session-level utility and guardrails such as negative feedback, integrity violations, creator concentration, and latency. Logged-policy bias is even more severe because the existing ranker determines what users had the chance to engage with. A strong answer would therefore combine offline ranking metrics with counterfactual evaluation, slice analysis, and ultimately an online experiment.

Common pitfalls

A common analytical mistake is saying “use accuracy” for ranking. Accuracy assumes a classification threshold and ignores item order, so it does not answer whether the best items appear near the top. A better answer would choose NDCG@K, MRR, recall@K, or MAP depending on whether the task is graded relevance, first-good-result discovery, retrieval coverage, or multi-result relevance.

A common communication mistake is listing every metric without tying them to the product. Saying “we can use precision, recall, F1, AUC, NDCG, and MAP” sounds memorized. A stronger response says, for example, “for candidate retrieval I care most about recall@1000 under latency constraints; for the final search page I care about NDCG@10 and success@1 because users rarely scroll deeply.”

A common depth mistake is ignoring label bias. Many candidates treat clicks as ground truth, but clicks are affected by position, thumbnails, social proof, prior ranker exposure, and user intent. Interviewers expect you to mention position bias, missing-not-at-random labels, randomized exploration data, inverse propensity weighting, or human judgments as mitigation strategies.

Connections

Interviewers may pivot from offline ranking metrics into experimentation, especially how to validate an offline metric with A/B tests or interleaving. They may also ask about recommender systems, ANN retrieval with embeddings, counterfactual evaluation, calibration, diversity/fairness constraints, or long-term metric tradeoffs such as engagement versus user well-being.