Retrieval, Candidate Generation, And Ranking Cases

What's being tested

Interviewers are probing whether you can design and evaluate large-scale recommendation/search systems where the core challenge is narrowing billions of possible items into a small, high-quality ranked set under latency, freshness, and business constraints. They are not looking for generic “use ML to rank posts” answers; they want to see whether you understand the retrieval → candidate generation → ranking → re-ranking pipeline and where different models, metrics, and tradeoffs belong. For Meta, this matters because products like Feed, Reels, Stories, Ads, Marketplace, Groups, Search, and People You May Know all depend on matching users to content efficiently and safely. A strong Data Scientist answer combines system design, metric selection, experimentation, bias awareness, and product judgment.

Core knowledge

Large-scale ranking systems are usually multi-stage: source retrieval creates thousands to millions of possible candidates, candidate generation reduces them to hundreds/thousands, ranking scores them with heavier models, and re-ranking applies business rules, diversity, freshness, or integrity constraints.
Retrieval optimizes recall under tight latency. Common sources include social graph edges, followed entities, collaborative filtering, embedding similarity, trending content, recent interactions, geographic proximity, and explicit query matching. The goal is not perfect ordering; it is avoiding false negatives early.
Approximate nearest neighbor search is central for embedding-based retrieval. Systems such as FAISS, ScaNN, HNSW, IVF, and product quantization trade exactness for speed. Brute-force dot product may work up to millions of vectors; at hundreds of millions or billions, ANN is usually required.
Two-tower models are common for candidate generation: one tower embeds users/queries, the other embeds items, and relevance is approximated by $s(u,i)=e_u^\top e_i$ or cosine similarity. They scale well because item embeddings can be precomputed, but they capture fewer cross-features than deep ranking models.
Ranking models can be much heavier because they score fewer items. Common choices include GBDT/XGBoost, logistic regression with feature crosses, Wide & Deep, DeepFM, DLRM-style architectures, transformers for text/video, or learning-to-rank methods such as LambdaMART.
Ranking objectives should reflect the product goal. For binary engagement, cross-entropy is common: $L=-\sum_i y_i\log p_i+(1-y_i)\log(1-p_i).$ For ordered lists, use pairwise/listwise losses and evaluate with NDCG, MAP, MRR, Recall@K, Precision@K, or expected utility.
NDCG rewards putting the best items near the top: $DCG@K=\sum_{i=1}^K \frac{rel_i}{\log_2(i+1)}, \quad NDCG@K=\frac{DCG@K}{IDCG@K}.$ It is useful when relevance has graded labels such as click, like, comment, share, hide, or long watch.
Offline metrics are necessary but insufficient. AUC or log loss may improve while session time, long-term retention, creator ecosystem health, or user satisfaction worsens. Meta-style answers should distinguish offline validation, online A/B tests, guardrail metrics, and long-term ecosystem effects.
Position bias and selection bias are major pitfalls. Training labels come from previously ranked impressions, so observed clicks are not unbiased relevance labels. Techniques include randomized exploration buckets, inverse propensity weighting, counterfactual learning-to-rank, debiasing by position, and interleaving experiments.
Re-ranking often enforces constraints after ML scoring: remove policy-violating content, cap repeated authors, promote source diversity, insert ads, respect freshness, downrank clickbait, or satisfy inventory contracts. This can improve product health but may reduce short-term engagement metrics.
Latency budgets shape architecture. Retrieval may need tens of milliseconds, ranking perhaps under 100–300 ms depending on surface, with caching, feature stores, batching, model distillation, and approximate computation. Freshness-sensitive surfaces need streaming features, not only daily batch pipelines.
Candidate generation should be evaluated separately from ranking. Key metrics include candidate Recall@K against known positive items, source contribution, duplication rate, freshness distribution, coverage across creators/items, and incremental lift. A great ranker cannot recover items that retrieval never surfaced.

Worked example

For “Design a ranking system for Facebook News Feed”, a strong candidate would first clarify the objective: are we optimizing meaningful social interactions, time well spent, session retention, ad revenue, or reducing negative feedback? They would also ask about constraints: latency budget, available labels, whether ranking is personalized, and what content types are included, such as friend posts, Groups, Pages, Reels, and ads. The answer should be organized around four pillars: candidate sourcing, feature/model design, evaluation, and re-ranking/constraints.

For candidate sourcing, they might describe multiple retrieval channels: recent friend posts, Groups activity, followed Pages, embedding-similar content, historically engaging authors, and fresh/trending content. For ranking, they would propose predicting multiple outcomes such as click, like, comment, share, dwell time, hide, report, and long-session retention, then combining them into a utility score like $U=w_1P(comment)+w_2P(share)+w_3E(dwell)-w_4P(hide)-w_5P(report)$ . For evaluation, they would separate offline metrics like NDCG@K and calibration from online A/B metrics such as daily active usage, meaningful interactions, negative feedback, and retention.

One explicit tradeoff to flag is engagement versus user well-being: optimizing only click probability can over-rank sensational or low-quality content, so guardrails and integrity classifiers must be part of the design. They should also mention freshness and diversity, because a feed with ten posts from the same friend may score well individually but feel repetitive. A strong close would be: “If I had more time, I’d discuss counterfactual bias from logged rankings, exploration traffic to improve training data, and long-term creator/user ecosystem metrics.”

A second angle

For “Design candidate generation for People You May Know”, the same architecture applies, but the constraints shift from content ranking to graph-based retrieval and trust/safety. Candidate sources would include mutual friends, workplace/school networks, contact imports, location overlap, group co-membership, and embedding similarity over graph neighborhoods. The ranking model would optimize friend request send probability, accept probability, downstream interaction, and negative outcomes like blocks, reports, or privacy complaints. Unlike Feed, freshness is less central, while privacy, creepiness, and explanation quality are critical; “because you both know Alice” may be more acceptable than opaque location-based recommendations. Evaluation should include not only accepted friendships but also long-term interaction quality and guardrails around spam, harassment, and sensitive attribute leakage.

Common pitfalls

An analytical mistake is treating ranking as a single binary classification problem and stopping at AUC. AUC can improve while top-of-feed quality declines, because users only see the first few slots. Better answers discuss top-K metrics such as NDCG@K, Recall@K for retrieval, calibration, and online business/product metrics.

A communication mistake is jumping straight into “train a neural network” without defining the objective and unit of ranking. Interviewers want to know whether you are ranking posts, creators, ads, people, products, or search results; each has different labels, constraints, and failure modes. Start with the product goal, user action, inventory, and latency assumptions.

A depth mistake is ignoring bias in logged data. If historical rankers determined what users saw, then clicks are confounded by position, exposure, and prior model decisions. A stronger answer mentions randomized exploration, position debiasing, inverse propensity weighting, or at least the limitation that offline labels reflect exposure rather than true relevance.

Connections

Interviewers may pivot from this topic into experimentation, especially A/B testing ranking changes with network effects, novelty effects, and guardrail metrics. They may also ask about causal inference for biased logs, feature engineering and feature stores, cold start recommendations, marketplace/search quality, or responsible AI concerns such as fairness, privacy, and integrity enforcement.