Feed And News Feed Ranking

What's being tested

Meta feed ranking interviews test whether a Data Scientist can reason about a personalized ranking system through metrics, causal inference, experimentation, and SQL-based product analytics. The interviewer is probing whether you can distinguish “users clicked more” from “the system caused better long-term user value,” especially when posts differ by source, position, user intent, and prior engagement. Meta cares because small ranking changes can shift billions of impressions across friends, groups, creators, ads, and unconnected recommendations, with large effects on `DAU`, retention, revenue, creator distribution, and user trust.

Core knowledge

Feed ranking is a multi-objective optimization problem: candidate posts are scored by predicted value such as $E[\text{utility}] = w_1p(\text{click}) + w_2p(\text{comment}) + w_3p(\text{share}) - w_4p(\text{hide})$ . A DS should critique the objective, calibration, tradeoffs, and online metric alignment.
Position bias is central. Higher-ranked posts get more impressions, attention, and clicks regardless of intrinsic quality. Comparing friend posts to unconnected posts without controlling for `feed_position`, user, session, freshness, and candidate pool quality will usually overstate the winner.
Counterfactual evaluation asks: what would the user have done if a different item had been ranked there? Useful approaches include randomized experiments, interleaving, holdouts, inverse propensity weighting $E[Y] \approx \frac{1}{n}\sum_i \frac{T_iY_i}{p_i}$ , and regression adjustment, but each relies on assumptions about logging and overlap.
Experiment design for ranking changes usually randomizes at the user level, not impression level, to avoid a user seeing inconsistent ranking policies within one session. Watch for network interference: one user’s feed treatment can affect friends’ posting, reactions, notifications, and downstream engagement.
Primary metrics should match the product hypothesis. For feed quality, good candidates include `D1/D7 retention`, sessions per user, meaningful interactions, hides, reports, negative feedback, long-clicks, comments, reshares, survey quality, and “socialness” measures such as friend interactions per session.
Guardrail metrics prevent local optimization. For unconnected content or ads, track friend content consumption, creator ecosystem distribution, `CTR`, `CVR`, ad load, hide/report rate, integrity prevalence, latency proxies if available, and revenue metrics like `ARPDAU` without making revenue the only objective.
Short-term engagement can conflict with long-term retention. A sensational post may increase clicks today but reduce feed trust. Meta-style DS answers should separate immediate actions, session-level satisfaction, and longer-term outcomes like `D7`, `D28`, return sessions, or survey-based quality.
Heterogeneous treatment effects matter more than global averages. Segment by new versus tenured users, heavy versus light feed consumers, friend graph density, market, age bucket if appropriate, content inventory, session depth, and prior tolerance for recommendations or ads.
User fixed effects are useful in observational analyses because they compare outcomes within the same user across exposure conditions: $Y_{ui} = \alpha_u + \beta T_{ui} + \gamma X_{ui} + \epsilon_{ui}$ . They reduce stable user-level confounding but do not fix time-varying selection or ranker-driven exposure bias.
SQL analytics for feed questions often requires sessionization, impression-click joins, deduplication, and cohort retention. Common patterns include `COUNT(DISTINCT post_id)`, position bins via `CASE`, `ROW_NUMBER()` for first impression or first click, and cohort joins from install/exposure date to later activity dates.
Offline ranking metrics such as `AUC`, `NDCG@K`, `MAP`, calibration error, and log loss are useful diagnostics, not launch criteria. They can be biased by historical ranking logs because the model only observes labels for content the old policy chose to show.
Sample ratio mismatch, novelty effects, and multiple testing are common in feed experiments. Before interpreting lifts, check assignment balance, exposure rates, logging consistency, pre-period covariates, triggered-user definitions, and whether many slices were searched without correction such as Benjamini-Hochberg or Bonferroni.

Worked example

For “Prove friends outperform unconnected; design experiments and metrics”, a strong candidate would first clarify what “outperform” means: higher engagement, more meaningful social interactions, better retention, fewer hides, or higher survey quality. They would also ask whether the comparison is between content sources holding rank position fixed, or between full ranking policies that allocate more feed slots to friends versus unconnected content. The answer should be organized around four pillars: define metrics, establish an observational baseline, design a randomized experiment, and interpret heterogeneous effects with guardrails.

The candidate might propose a user-level A/B test where treatment increases the share or score weight of friend posts, while control keeps the existing ranking policy. The primary metric could be `D7 retention` or meaningful friend interactions per feed session, with secondary metrics like `CTR`, comments, hides, reports, session depth, and creator/unconnected engagement loss. A key tradeoff to flag is that friend posts and unconnected posts differ in inventory: some users have sparse friend graphs, so a blanket boost may help highly connected users but harm users with little fresh friend content. The candidate should also explain why a naïve observational comparison of average reactions per post is confounded by position, personalization, and selection into exposure. A good close would be: “If I had more time, I’d add long-term holdouts, survey quality, and segment analysis by friend graph density to avoid overgeneralizing the average treatment effect.”

A second angle

For “Compute feed ad frequency and retention in SQL”, the same core concept becomes metric construction rather than experiment design. The candidate still needs to connect feed exposure to downstream outcomes, but now the task is to correctly aggregate impressions, sessions, clicks, and retained users before making any causal claim. The key framing is: define ad frequency per user-session, join clicks to impressions without double counting, bin by feed position, and compute cohort retention from a clean exposure date. The statistical issue remains confounding: heavier users naturally see more ads and are also more likely to return, so the candidate should mention stratification, user fixed effects, or experiment assignment if asked to interpret frequency effects causally. This shows that even SQL-heavy feed questions are testing product analytics judgment, not just syntax.

Common pitfalls

Pitfall: Treating engagement as an unbiased measure of content quality.

A tempting answer is “friend posts are better because they get more comments” or “unconnected posts are better because they get more clicks.” That misses position bias, source selection, and intent differences; a stronger answer says what controls or randomization are needed before interpreting the metric.

Pitfall: Jumping straight to model features or ranking architecture.

For a Data Scientist interview, do not spend most of the answer on feature stores, serving latency, candidate generation infrastructure, or deep model architecture. It is fine to mention predicted click or predicted hide probabilities, but the emphasis should be on metric validity, experiment design, causal interpretation, and launch decisioning.

Pitfall: Reporting only the average treatment effect.

Feed changes often redistribute value across users and content types. A +0.3% lift in `DAU` can hide harm to new users, users with sparse friend graphs, or users exposed to more low-quality recommendations; always include segmentation, guardrails, and a plan for diagnosing tradeoffs.

Connections

Interviewers often pivot from feed ranking into A/B testing, causal inference, SQL cohort analysis, metric design, or recommender-system evaluation. Be prepared to discuss observational versus experimental evidence, position-bias correction, retention cohorts, and why offline ranking gains may fail to translate into online product impact.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts