Recommender And Ranking Systems

What's being tested

Interviewers are probing whether you can reason about ranking quality as a Data Scientist: define the objective, choose labels and features, evaluate models offline and online, and diagnose tradeoffs between users, creators, businesses, and advertisers. For Meta, recommendations sit inside high-stakes surfaces like Feed, Ads, Shops, and local discovery, where small ranking changes can affect CTR, CVR, revenue, session quality, and long-term retention. A strong answer does not just say “train a model to predict clicks”; it explains what behavior the system should optimize, how bias enters the data, how to evaluate counterfactual changes, and how to guard against feedback loops. The interviewer is looking for structured product-statistical thinking, not low-level serving architecture.

Core knowledge

Two-stage recommender architecture is the default framing: candidate generation retrieves hundreds or thousands of plausible items, then ranking scores and orders them. As a DS, focus on whether each stage preserves recall for valuable items and whether offline evaluation matches online product goals.
Objective design should translate product intent into a scoring function, not blindly maximize clicks. For ads, a common utility is expected value:
$\text{score} = P(\text{click}) \times P(\text{conversion} \mid \text{click}) \times \text{bid} \times \text{quality adjustment}$
For Feed or local recommendations, utility may combine engagement, satisfaction, freshness, diversity, and negative feedback.
Label choice changes the system’s incentives. Optimizing CTR favors curiosity and clickbait; optimizing dwell time can favor addictive or low-quality content; optimizing purchases can under-rank discovery content. Strong answers discuss primary labels, guardrail labels, delayed labels, and negative labels like hides, reports, skips, unsubscribes, and “not interested.”
Feature families usually include user features, item features, context features, and cross features. Examples: historical engagement rate, user-category affinity, item freshness, creator quality, geo distance, price bucket, social proof, and query/session context. For cold start, item metadata and population-level priors matter more than personalized history.
Collaborative filtering captures “users like you liked items like this,” using matrix factorization, nearest neighbors, or embeddings. It works well with dense interaction data but struggles with new users, new items, and popularity bias. A DS should name these limitations and pair it with content-based or contextual signals.
Learning-to-rank methods optimize ordered lists rather than independent predictions. Pointwise models predict outcomes like CTR; pairwise methods learn preferences between items; listwise methods optimize ranking losses closer to NDCG or MAP. In product interviews, explain why top-of-list quality often matters more than average prediction error.
Calibration matters when scores feed auctions, pacing, or multi-objective tradeoffs. A model with good AUC can still overpredict probabilities for some cohorts. Use calibration curves, expected calibration error, and segment-level reliability checks; for ads, miscalibration can overcharge advertisers or distort delivery.
Offline metrics should match the task: AUC for discrimination, log_loss for probabilistic accuracy, NDCG@K or MRR for ranked lists, recall@K for candidate generation, and calibration metrics for probability estimates. Always segment by new users, new items, geography, device, language, advertiser size, and high/low activity users.
Online metrics need a metric hierarchy: primary success metric, guardrails, and diagnostics. For ads this may include revenue, CTR, CVR, advertiser ROAS, user negative feedback, ad hide rate, and session engagement. For organic recommendations, include retention, meaningful interactions, diversity, and integrity metrics.
Selection bias and position bias are central. Logged data reflects what the old ranker chose to show and where it placed items. Corrective tools include randomized buckets, interleaving tests, inverse propensity scoring:
$\hat{V}_{IPS} = \frac{1}{n}\sum_i \frac{\mathbb{1}(a_i=\pi(x_i)) r_i}{p_i}$
and doubly robust estimators when propensities are noisy.
Exploration versus exploitation is unavoidable. A purely exploitative ranker overuses known winners and starves new items. Contextual bandits, epsilon-greedy exploration, Thompson sampling, or randomized traffic slices can collect unbiased evidence, but the DS must quantify user-experience cost and monitor guardrails.
Cold start and feedback loops require explicit mitigation. For new restaurants, hashtags, shops, or ads, use metadata, geo/category priors, creator/merchant reputation, image/text embeddings, and controlled exploration. Watch for rich-get-richer dynamics where early exposure creates engagement, which creates more exposure, independent of intrinsic quality.

Worked example

For “Design an ad recommendation and ranking system,” a strong first 30 seconds would clarify the surface, advertiser objective, auction constraints, and whether success is user engagement, conversions, revenue, or long-term ad quality. I would state an assumption: we are ranking eligible ads for a user impression and want to maximize expected platform value while protecting user experience and advertiser outcomes. I would organize the answer into four pillars: objective and metrics, candidate generation and ranking signals, model evaluation, and experimentation/diagnosis. The objective could be framed as expected utility using predicted click probability, conversion probability, bid, and quality penalties for hides, reports, or low landing-page quality. For evaluation, I would separate offline checks like AUC, log_loss, calibration, and segment performance from online A/B metrics like revenue per impression, CTR, CVR, advertiser ROAS, ad hide rate, and session-level engagement. I would explicitly flag the tradeoff between short-term revenue and long-term user trust: a model that increases ad clicks but also increases hides or reduces session depth may not be launchable. I would also discuss selection bias because the training data comes from the previous ad ranker, so offline replay can be misleading without randomized exploration or propensity-aware evaluation. I would close by saying that, with more time, I would drill into calibration by advertiser segment, fairness of delivery across small and large advertisers, and long-term incrementality rather than attributing every observed conversion to the ad.

A second angle

For “Design a restaurant recommender under cold start,” the same ranking framework applies, but the hardest constraint shifts from auction value to sparse signal quality. A new restaurant may have no clicks, ratings, saves, or visits, so the DS should lean on content and context: cuisine, price range, location, hours, photos, menu text, neighborhood popularity, and similarity to restaurants the user already likes. The objective should balance personalization with exploration because showing only established restaurants will prevent new restaurants from ever collecting evidence. Offline metrics like recall@K and NDCG@K are useful, but online evaluation should include saves, map clicks, calls, reservations, visit proxies, negative feedback, and geographic coverage. The key tradeoff is between relevance and diversity: the best list may not be ten nearly identical pizza places, even if each has high predicted engagement.

Common pitfalls

Pitfall: Treating the problem as “maximize CTR with the most accurate model.”

This is the classic analytical mistake. Clicks are easy to observe but often misaligned with satisfaction, purchase quality, advertiser value, or long-term retention. A better answer starts with a utility function and a metric hierarchy, then explains why CTR is one component rather than the sole target.

Pitfall: Jumping into model architecture before defining labels, metrics, and bias.

Saying “I’d use a deep neural network with embeddings” sounds technical but can miss the DS role. Interviewers want to hear how you would construct labels, handle delayed conversions, measure calibration, segment results, and design the experiment. Model choice should serve the measurement problem, not replace it.

Pitfall: Ignoring logged-policy bias and feedback loops.

Offline data is not an unbiased sample of all possible recommendations; it is the outcome of previous ranking decisions. If you train only on exposed items, the model may learn that popular items are inherently better because unpopular items were never shown. Strong candidates mention exploration, randomized holdouts, propensity weighting, and segment-level monitoring.

Connections

Interviewers often pivot from recommender systems into experimentation, especially A/B test design, guardrail metrics, heterogeneous treatment effects, and launch decisions. They may also move into causal inference, including incrementality, attribution, counterfactual evaluation, and position-bias correction. Adjacent ML topics include calibration, learning-to-rank, multi-objective optimization, and cold-start analysis.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts