Ranking, Recommender, And Personalization Systems

What's being tested

Interviewers are probing whether you can reason about ranking and personalization as an end-to-end product system, not just as a modeling problem. For a Meta Data Scientist, the core skill is translating ambiguous user/business goals into measurable objectives, understanding how recommender models are trained and evaluated, and anticipating feedback loops, bias, and experimentation issues. Strong answers show you can balance engagement, quality, integrity, creator ecosystem health, latency, and long-term retention. The interviewer is usually testing whether you can choose the right metric under constraints, diagnose ranking failures, and explain tradeoffs clearly to product, engineering, and ML partners.

Core knowledge

Most large-scale recommenders use a multi-stage architecture: candidate generation, lightweight filtering, ranking, re-ranking, and post-processing. Candidate generation narrows billions of items to hundreds or thousands using retrieval models; ranking applies heavier models; re-ranking enforces diversity, policy, freshness, or inventory constraints.
Candidate generation methods include collaborative filtering, item-to-item similarity, graph-based retrieval, two-tower neural networks, and approximate nearest neighbor search. A two-tower model embeds users and items separately and retrieves by dot product: $s(u, i) = e_u^\top e_i$ . ANN systems like FAISS or ScaNN are used once exact search becomes too slow, often beyond millions of candidate vectors.
Ranking models predict multiple outcomes, not just clicks. A Meta-style feed model may estimate $P(\text{like})$ , $P(\text{comment})$ , $P(\text{share})$ , $P(\text{hide})$ , $P(\text{report})$ , dwell time, downstream retention, or creator quality. Final utility is often a weighted function:
$U(i,u)=w_1p(\text{like})+w_2p(\text{comment})-w_3p(\text{hide})-w_4p(\text{report})+w_5E[\text{long-term value}]$
Offline ranking metrics differ from online product metrics. Common offline metrics include AUC, log loss, precision@k, recall@k, MAP, MRR, and NDCG. NDCG captures position-weighted relevance:
$DCG@k=\sum_{j=1}^{k}\frac{rel_j}{\log_2(j+1)}, \quad NDCG@k=\frac{DCG@k}{IDCG@k}$
But offline gains may not translate online due to logging bias and changed user behavior.
Position bias is central. Users click top-ranked content more often because it is visible, not necessarily because it is better. Training directly on clicks can reinforce exposure bias. Mitigations include randomized buckets, inverse propensity scoring $w=1/P(\text{exposure})$ , interleaving tests, exploration traffic, or counterfactual learning-to-rank.
Recommenders create feedback loops. If the system shows more videos, it observes more video engagement and may conclude users only want video. This can reduce content diversity, creator reach, or long-term satisfaction. Good systems add exploration, freshness boosts, diversity constraints, and long-term guardrail metrics.
Cold start has separate user and item variants. For new users, rely on onboarding signals, geography, demographics if allowed, device/language, social graph, and popularity priors. For new items, use creator reputation, content embeddings, text/image/video features, early engagement velocity, and controlled exploration.
Personalization should be evaluated across segments, not only globally. A treatment can improve average watch time while hurting new users, low-activity users, teens, creators, or users in smaller markets. Always check heterogeneous treatment effects, fairness, integrity slices, and whether gains are concentrated among already-heavy users.
Online experimentation is the gold standard for product impact. Ranking A/B tests should track primary metrics such as session time, meaningful interactions, retention, or revenue, plus guardrails like hides, reports, unfollows, latency, crash rate, diversity, and creator distribution. Network effects and interference are common because users, creators, and content are interconnected.
Latency and freshness are first-class constraints. A feed ranking system may need to return results in tens to hundreds of milliseconds. Heavy features may be precomputed; real-time features like recent clicks or impressions may be stored in low-latency feature stores. There is a tradeoff between model complexity, feature freshness, and serving reliability.
Diversity and exploration are often handled after scoring. Techniques include maximal marginal relevance, topic caps, source caps, deduplication, freshness boosts, and bandit exploration. MMR balances relevance and novelty:
$\arg\max_i \lambda \cdot score(i) - (1-\lambda)\max_{j \in S} sim(i,j)$
where $S$ is the already-selected set.
Long-term value is harder than short-term engagement. Clickbait can maximize immediate CTR while reducing trust or retention. Better objectives include survey-based quality labels, “meaningful social interactions,” repeat usage, negative feedback rates, and delayed outcomes. Strong candidates explicitly separate short-term proxy metrics from the true product goal.

Worked example

Design a ranking system for Facebook News Feed.

A strong candidate would start by clarifying the product goal: are we optimizing for meaningful engagement, time spent, friend connection, content discovery, or retention, and are there constraints around integrity, ads, or creator distribution? They would also ask what surface is being ranked, what inventory is eligible, and whether the system is for logged-in users with rich histories or includes cold-start users. The answer should then be organized around four pillars: candidate generation, ranking objective, evaluation, and iteration/monitoring. For candidate generation, they might mention retrieving posts from friends, followed pages, groups, recommendations, and fresh popular content, then filtering blocked, seen, policy-violating, or low-quality items. For ranking, they would propose predicting multiple outcomes such as comments, reactions, shares, hides, reports, dwell time, and survey quality, then combining them into a utility score aligned with product goals. For evaluation, they would separate offline metrics like log loss and NDCG from online A/B metrics like meaningful interactions per user, day-7 retention, negative feedback, latency, and content diversity. One explicit tradeoff to flag is engagement versus long-term quality: a model that boosts outrage or clickbait may improve comments but worsen hides, reports, or future retention. They should also mention feedback loops: if the feed overexposes one content type, future training data becomes biased toward that type. A strong close would be: “If I had more time, I’d go deeper on counterfactual evaluation, creator-side ecosystem metrics, and how to tune the utility weights through experiments rather than intuition.”

A second angle

Recommend Groups to users.

The same ranking concepts apply, but the unit being ranked is not a transient post; it is a semi-permanent community with long-term consequences for user experience. Candidate generation might come from friends’ memberships, location, interests, search behavior, similar-user embeddings, or group-topic embeddings. The objective should emphasize join probability only as one part of value; better metrics include active participation after joining, notifications muted, exits, reports, and long-term retention. Cold start is more prominent because new or niche groups may have sparse engagement data, so content/topic understanding and early quality signals matter more. The system also needs stronger guardrails around integrity, safety, and recommendation eligibility because recommending a harmful group can be worse than ranking one low-quality feed post.

Common pitfalls

Analytical mistake: optimizing only for CTR or watch time.
A tempting answer is “rank by predicted click probability,” but that ignores negative feedback, long-term retention, user trust, and ecosystem effects. A better answer defines a multi-objective utility function and names primary, secondary, and guardrail metrics before discussing models.

Communication mistake: jumping into algorithms before clarifying the goal.
Candidates often start with “I’d use collaborative filtering or a neural network” without asking what success means. In a Meta interview, framing matters: first clarify the surface, user population, inventory, constraints, and product objective, then choose an algorithm that fits those constraints.

Depth mistake: treating offline accuracy as proof of product impact.
Improving AUC or NDCG does not guarantee a better feed because logged data is biased by the previous ranking policy. Stronger answers mention online A/B tests, exploration data, segment analysis, counterfactual bias, and guardrails like reports, hides, latency, and diversity.

Connections

Interviewers may pivot from ranking into experimentation, especially how to design A/B tests when network effects or creator-side interference exist. They may also probe causal inference, metric design, fairness, integrity, ads auction dynamics, or ML system design topics such as feature stores, ANN retrieval, and model monitoring.