Recommender, Ranking, And Ads Systems

What's being tested

Interviewers are probing whether you can reason about personalized ranking systems under real business, product, and statistical constraints. For Meta, this means understanding how feeds, recommendations, notifications, and ads choose from billions of candidate items while optimizing user value, advertiser value, integrity, and long-term ecosystem health. The goal is not to recite “collaborative filtering” or “CTR prediction,” but to show that you can define the objective, choose metrics, diagnose model/system failures, and reason about experimentation tradeoffs. Strong candidates connect modeling choices to product outcomes: engagement, retention, revenue, creator health, ad quality, fairness, and user trust.

Core knowledge

Large-scale ranking systems are usually multi-stage: candidate generation, lightweight retrieval, heavy ranking, re-ranking, and policy/filtering. Candidate generation may reduce billions of items to thousands; ranking reduces thousands to tens. Each stage optimizes a different latency-quality tradeoff.
Retrieval often uses approximate nearest neighbor search over embeddings, such as FAISS, HNSW, ScaNN, or product quantization. Exact nearest-neighbor search is feasible for small corpora, but for tens or hundreds of millions of items, approximate retrieval is needed to satisfy sub-100ms latency constraints.
Collaborative filtering learns from user-item interactions. Matrix factorization approximates an interaction matrix $R$ as $R \approx U V^\top$ , where $U$ and $V$ are user and item embeddings. It works well for dense behavioral data but struggles with cold-start users, new content, and shifting interests.
Modern recommender systems use supervised learning to predict multiple outcomes: click, like, comment, share, hide, dwell time, conversion, or report. A simple score might be
$Score = w_1 P(click) + w_2 P(comment) + w_3 E(dwell) - w_4 P(hide) - w_5 P(report).$
The hard part is choosing weights aligned with long-term value.
Ads ranking often combines user value, advertiser bid, and estimated action rate. A simplified auction score is
$AdRank = Bid \times P(conversion \mid user, ad, context) \times Quality.$
Optimizing only revenue can degrade user experience; optimizing only click-through rate can favor clickbait or low-quality ads.
Offline metrics include AUC, log loss, calibration error, precision@k, recall@k, NDCG, MAP, and hit rate. Ranking metrics matter because top positions dominate user attention. NDCG is common when relevance is graded:
$DCG@k = \sum_{i=1}^k \frac{rel_i}{\log_2(i+1)}.$
Offline performance does not guarantee online lift. A model with better AUC may hurt engagement if it improves ranking for low-value segments, is miscalibrated at the top of the distribution, or changes item diversity. Always validate with A/B tests on product metrics.
Position bias and selection bias are central. Users only interact with items they are shown, so observed clicks are not unbiased labels. Counterfactual methods include inverse propensity scoring, randomized exploration buckets, interleaving, and logging propensities for displayed items.
Cold start requires separate strategies. For new users, use onboarding signals, geo/language/device context, popularity priors, and exploration. For new items, use content embeddings from text, image, video, creator metadata, or early engagement velocity. Avoid over-penalizing new content before it has exposure.
Exploration-exploitation tradeoffs are unavoidable. Pure exploitation over-serves known high-performing content and can create filter bubbles. Common approaches include $\epsilon$ -greedy, Thompson sampling, contextual bandits, diversity constraints, and explicit exploration quotas for new items or creators.
Multi-objective optimization is the norm. Meta-like systems balance engagement, session quality, retention, revenue, creator ecosystem health, integrity, and negative feedback. A sophisticated answer distinguishes guardrail metrics from primary objectives and discusses how metric weights are set through experiments and product judgment.
Recommender failures are often systemic, not just model errors: feedback loops, popularity bias, creator starvation, ad fatigue, duplicate content, distribution shift, delayed labels, bot activity, and policy violations. A strong diagnosis checks data, model calibration, ranking logic, serving latency, logging, and experiment validity.

Worked example

Design a News Feed ranking system

A strong candidate should start by clarifying the product goal: are we optimizing meaningful engagement, time spent, retention, content quality, or reducing negative experiences? They should also ask about constraints: latency budget, available signals, whether this is for new users or all users, and whether integrity filters are handled before or after ranking. The answer can be organized around four pillars: candidate generation, ranking model, re-ranking/business rules, and evaluation. For candidate generation, they might combine social graph sources, followed pages, groups, recently popular posts, and embedding-based retrieval. For ranking, they would predict multiple outcomes such as click, comment, share, dwell time, hide, and report, then combine them into a utility function with penalties for low-quality or harmful content. For re-ranking, they should mention diversity, freshness, author repetition caps, policy enforcement, and exploration for new content. A key tradeoff to flag is engagement versus long-term user value: optimizing short-term clicks may increase sensational content, while stronger quality penalties may reduce immediate engagement but improve retention and trust. They should close by saying that, with more time, they would discuss counterfactual logging, calibration, online experimentation, and how to monitor ecosystem effects on creators and content diversity.

A second angle

Design an ads ranking system

The same ranking principles apply, but the objective now includes advertiser value and auction mechanics in addition to user experience. Instead of ranking organic posts purely by predicted user utility, ads ranking often uses bid, estimated action rate, and quality adjustments to determine which ad wins and what price is charged. The candidate should distinguish CTR prediction from conversion prediction, since an advertiser optimizing purchases cares about downstream value rather than clicks. Constraints are also different: ad fatigue, budget pacing, targeting eligibility, privacy restrictions, and marketplace liquidity matter. A strong answer explicitly protects users with ad quality scores, frequency caps, negative feedback penalties, and guardrails on feed satisfaction or retention.

Common pitfalls

Analytical mistake: optimizing the wrong metric. A tempting answer is “rank by predicted CTR” because CTR is easy to measure and model. That is incomplete: CTR can reward clickbait, low-quality ads, or shallow engagement. A better answer defines a utility function with positive and negative outcomes, then validates it against long-term metrics such as retention, satisfaction surveys, conversion quality, and hide/report rates.

Communication mistake: jumping straight to algorithms. Candidates often start with “I would use deep learning embeddings and XGBoost” before defining the product goal or ranking surface. Interviewers want to see problem framing first: what is being ranked, for whom, under what constraints, and what success means. The model choice should follow from the objective and data, not lead the answer.

Depth mistake: ignoring bias in logged data. Recommender labels are not randomly observed; they are generated by the previous ranking system. Treating observed clicks as ground truth can amplify popularity bias and make offline evaluation misleading. A stronger answer mentions position bias, exploration data, propensity logging, counterfactual evaluation, and online A/B testing.

Connections

Interviewers may pivot from ranking design to experimentation, especially A/B testing, guardrail metrics, network effects, and novelty effects. They may also push into causal inference for recommender systems, marketplace dynamics for ads auctions, or machine learning system design topics such as feature freshness, model calibration, and serving latency.