Recommender, Ranking, And Ads ML Systems

What's being tested

Meta is testing whether you can reason about ranking systems as a Data Scientist: define the right objective, construct trustworthy labels, evaluate models offline and online, and diagnose why a recommendation change helped or hurt users, advertisers, or creators. The interviewer is not looking for a low-level serving architecture; they are probing whether you can choose between CTR, CVR, expected value, retention, marketplace quality, and long-term satisfaction under real constraints. Strong answers connect model choices to causal validity, selection bias, calibration, cold start, and experiment design. Meta cares because small ranking changes in Ads, Feed, Shops, or recommendations can shift billions of impressions and create second-order effects across users, advertisers, and content ecosystems.

Core knowledge

Objective design is the first decision. Ads ranking often optimizes expected value:
$\text{score} = P(\text{click}) \times P(\text{conversion} \mid \text{click}) \times \text{value} \times \text{quality adjustment}$
A pure CTR objective can over-rank clickbait, low-margin products, or ads that cannibalize organic actions.
Multi-stage ranking usually separates candidate retrieval from final ranking. Collaborative filtering, two-tower embeddings, or approximate nearest neighbor methods retrieve hundreds or thousands of candidates; learning-to-rank models such as XGBoost, LightGBM, DLRM-style models, or neural rankers rescore the shortlist.
Collaborative filtering works when user-item interactions are dense enough. Matrix factorization learns vectors $u_i, v_j$ and predicts $\hat r_{ij}=u_i^\top v_j$ . For item catalogs up to roughly millions, embeddings are practical; with extreme sparsity, cold start requires content, context, popularity, or exploration signals.
Label construction determines what the model really learns. A click label is fast but noisy; conversion or purchase labels are delayed and sparse; dwell time or saves may reflect deeper value. For ads, delayed attribution windows create censoring, so compare models using consistent attribution logic.
Position bias is unavoidable in feeds and ads. Observed clicks are conditional on exposure and rank, not just relevance. Use randomized buckets, interleaving, swap experiments, or inverse propensity scoring:
$\hat V_{\text{IPS}} = \frac{1}{n}\sum_i \frac{\mathbb{1}(a_i=\pi(x_i))y_i}{p(a_i \mid x_i)}$
when evaluating counterfactual ranking policies.
Offline metrics should match the business question. Use AUC for discrimination, log loss for probabilistic accuracy, ECE or calibration plots for calibrated probabilities, NDCG@K or MAP@K for ranked relevance, and expected revenue or utility-weighted metrics for ads and commerce.
Calibration matters because ranking scores often multiply probabilities by bids or values. A model with good AUC but poorly calibrated pCTR can misprice ads or distort auctions. Segment calibration by device, geography, advertiser size, new versus returning users, and high- versus low-frequency users.
Cold start requires different evidence. For new users, rely on geography, language, device, onboarding interests, session context, and population priors. For new items, shops, restaurants, or hashtags, use metadata, text/image embeddings, creator/shop quality, category popularity, and controlled exploration.
Exploration versus exploitation is a product and measurement issue, not just an algorithmic one. Greedy ranking reinforces popularity and starves new candidates of impressions. Bandit-style exploration, randomized traffic slices, or exploration quotas help estimate item quality while limiting user-experience cost.
Marketplace constraints often create objective tradeoffs. Ads and shop recommendations must balance user experience, advertiser value, seller diversity, policy quality, and revenue. A strong DS states the primary metric, guardrails like hide/report rate or low-quality clicks, and fairness or ecosystem diagnostics.
Online experiments are the decision standard. Launch decisions should examine primary metrics, guardrails, heterogeneous treatment effects, novelty effects, and interference. In Feed or Ads, one user’s treatment can affect advertisers or creators, so cluster-level or marketplace-level diagnostics may be necessary.
Failure diagnosis should be segment-first. If CTR rises but CVR falls, suspect lower-intent clicks or clickbait. If revenue rises but user engagement falls, inspect ad load, frequency, relevance, and fatigue. If offline metrics improve but online metrics regress, suspect logging bias, calibration drift, or metric mismatch.

Worked example

For “Propose an ads recommendation model for shop ads”, a strong candidate would first clarify the business surface: are these ads in Feed, Marketplace, Search, or Shops, and is the goal purchases, shop visits, advertiser ROI, or total platform value? I would state assumptions: we rank eligible shop ads for a user-session context, with labels for impressions, clicks, add-to-cart, purchases, and advertiser value. I would organize the answer around four pillars: objective definition, training data and labels, ranking/evaluation, and experimentation. For the objective, I would propose an expected utility score such as $pCTR \times pCVR \times \text{order value}$ , adjusted by ad quality, user feedback, and advertiser constraints, rather than optimizing CTR alone. For modeling, I would describe user, shop, item, query/context, and historical interaction features, while noting that new shops need content/category priors and exploration traffic. For evaluation, I would use offline AUC, log loss, NDCG@K, calibration by segment, and counterfactual checks for position bias, then validate with an A/B test on purchase value, advertiser ROI, user engagement, and negative feedback. One tradeoff I would explicitly flag is short-term revenue versus long-term user trust: aggressive ranking can increase clicks while increasing hides, low-quality purchases, or ad fatigue. I would close by saying that, with more time, I would add long-term value measurement, incrementality tests for conversions that would have happened anyway, and diagnostics for small advertisers versus large advertisers.

A second angle

For “Design a hashtag recommender for News Feed”, the same ranking principles apply, but the objective is less directly monetized and more about content creation, discovery, and downstream engagement. The candidate set may come from post text, image/video understanding, trending hashtags, user interests, and creator history, while labels could include hashtag adoption, post reach, engagement quality, or creator retention. The main risk is not auction value but relevance, spam, trend chasing, and feedback loops where popular hashtags become even more dominant. Evaluation should include precision@K, adoption rate, post engagement, hide/report rates, and creator-level heterogeneity. Unlike shop ads, exploration may focus on emerging hashtags and niche communities rather than advertiser ROI.

Common pitfalls

Pitfall: Optimizing the easiest label instead of the right outcome.

A tempting answer is “train a model to maximize CTR and rank by predicted clicks.” That is incomplete for ads, shops, and recommendations because clicks can be low-quality, biased by position, or misaligned with conversions and satisfaction. A stronger answer distinguishes proxy metrics from North Star outcomes and adds guardrails like CVR, revenue, hide/report rate, retention, and calibration.

Pitfall: Turning a Data Scientist answer into an infrastructure design.

Avoid spending the answer on streaming systems, feature store plumbing, latency budgets, or deployment mechanics unless asked. For a DS interview, it lands better to say what signals you would analyze, how you would validate them, which offline and online metrics you would trust, and how you would diagnose bias or regression.

Pitfall: Naming algorithms without explaining evaluation and failure modes.

Saying “use collaborative filtering” or “use a neural network ranker” is not enough. The interviewer wants to hear when it fails: sparse interactions, cold start, popularity bias, non-stationary trends, delayed conversion labels, and offline-online mismatch. Pair every model suggestion with a measurement plan and at least one segment diagnostic.

Connections

Interviewers may pivot from ranking into causal inference, especially incrementality, counterfactual evaluation, inverse propensity weighting, or randomized holdouts. They may also test experimentation, including metric tradeoffs, heterogeneous treatment effects, guardrail design, and interference in marketplaces. Adjacent topics include calibration, cold-start analysis, long-term metric design, and bias/fairness in recommendation exposure.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts