Candidate Generation, Ranking, And Feature Stores

What's being tested

These interviews test whether you can design a large-scale recommender ML system as an MLE: candidate generation, ranking, feature computation, model serving, evaluation, and deployment under latency constraints. Meta cares because surfaces like notifications, local recommendations, feed units, and nearby places require fast personalization from billions of possible items while preserving user experience and trust. The interviewer is probing whether you can separate retrieval from ranking, reason about online/offline feature parity, handle real-time context, and describe how you would measure and safely ship model improvements. Strong answers are not just “train a model”; they explain where features come from, how candidates are narrowed, how rankers are served, and how the system degrades when signals are missing or stale.

Core knowledge

Two-stage recommendation architecture is the default pattern: first retrieve hundreds or thousands of candidates cheaply, then score tens to hundreds with a heavier ranker. Retrieval optimizes recall under tight latency; ranking optimizes calibrated utility such as $P(\text{click})$ , $P(\text{save})$ , or expected long-term value.
Candidate generation usually combines multiple sources: collaborative filtering, embedding retrieval, popularity/trending lists, graph-based neighbors, geographic radius filters, and business-rule-safe fallbacks. For nearby places, sources might include “places within 2 km,” “friends visited,” “similar to places user saved,” and “currently popular nearby.”
Approximate nearest neighbor search with algorithms like HNSW, IVF, or product quantization enables embedding retrieval at large scale. Exact nearest-neighbor search is fine for up to roughly millions of vectors in controlled settings, but at hundreds of millions or billions, systems like `FAISS`, `ScaNN`, or `Annoy` are used to trade a small recall loss for major latency savings.
Ranking models should match serving constraints. `XGBoost`/GBDT-style models are strong baselines for tabular ranking; neural rankers can model richer user-item interactions but cost more at serving time. A common production pattern is: lightweight first-pass ranker, heavier second-pass ranker, then rule-based re-ranking for diversity, integrity, freshness, or notification fatigue.
Feature stores solve consistency between training and serving. The offline store supports point-in-time training examples; the online store serves low-latency feature lookups for inference. The MLE responsibility is not pipeline plumbing, but defining feature semantics, freshness requirements, defaults, transformations, and ensuring online values match the offline training view.
Point-in-time correctness prevents leakage. A training row for user $u$ , item $i$ , and timestamp $t$ must only use features known before $t$ . A common bug is using “total clicks in the next 7 days” or a post-impression aggregate as a feature, which inflates offline `AUC` and collapses online.
Feature freshness depends on signal type. Static features like category, language, and long-term user embeddings can refresh daily; session context, location, notification history, and recent interactions may need minute-level or request-time computation. Interviewers expect you to explicitly separate batch, near-real-time, and request-time features.
Online inference latency budget should be decomposed. For example, candidate generation might get 30–50 ms, feature lookup 20 ms, ranking 20–40 ms, and re-ranking/filtering 5–10 ms under a `p99` target. If the model is too slow, use caching, precomputed embeddings, model distillation, feature pruning, or staged ranking.
Objective design often requires multi-task learning or weighted utility. A notification ranker may combine open probability, downstream engagement, hide/report probability, and fatigue cost:
$\text{score} = w_1 P(\text{open}) + w_2 E(\text{value}) - w_3 P(\text{negative feedback}) - w_4 \text{fatigue}.$
The exact weights are usually tuned through offline evaluation and online experiments.
Evaluation should include both offline and online layers. Offline metrics include `AUC`, `log loss`, `NDCG@K`, `Recall@K`, calibration plots, and slice metrics by geography/device/new users. Online metrics include click-through rate, saves, hides, notification disables, session depth, latency, and guardrails like crashes or integrity violations.
Cold start needs explicit handling. New users rely more on context, location, demographics if allowed, onboarding choices, and global popularity. New places/items rely on metadata, category, geospatial priors, creator quality, and exploration buckets rather than historical engagement.
Monitoring should cover model and data behavior. Track feature missingness, value distributions, embedding norms, candidate-source mix, score calibration, serving latency, fallback rate, and business guardrails. Feature drift and label drift are especially important when location patterns, seasonality, or notification policies change.

Worked example

For Design Nearby and Notification Ranking, a strong candidate would start by clarifying whether “nearby” means physical location recommendations, people nearby, or local events, and whether notifications are push notifications, in-app notifications, or both. They would ask about the primary objective: opens, meaningful actions after open, long-term engagement, or minimizing notification fatigue. Then they would declare assumptions: user has location permission, ranking must run under a tight `p99` latency budget, and negative feedback such as hides or notification disables must be modeled.

The answer skeleton should have four pillars: candidate generation, feature design, ranking/serving, and evaluation/monitoring. Candidate generation would combine geospatial filters, user-place embeddings, social graph signals, and popularity/freshness candidates. Features would include distance, time of day, historical engagement with similar places, recent notification count, friend activity, place quality, and contextual availability. Ranking could use a multi-task model predicting open probability, downstream engagement, and negative feedback, with a final utility score penalizing fatigue and low-quality notifications.

One explicit tradeoff to flag is precompute versus request-time personalization: precomputing nearby candidates improves latency, but request-time location and freshness improve relevance. A good compromise is to precompute stable user/place embeddings and retrieve candidates online using current location and context. Close by saying: if there were more time, you would discuss exploration for underexposed places, privacy constraints around location, and how to run a staged rollout with calibration and fatigue guardrails.

A second angle

For Design nearby place recommendations, the same concepts apply, but the emphasis shifts from notification interruption cost to local relevance and geospatial retrieval. Candidate generation becomes more constrained by distance, open hours, category, safety filters, and place availability, while ranking can optimize saves, directions, check-ins, or profile views. Real-time context matters more: current location, day of week, weather-like context if available, and whether the user is traveling versus near home. The feature store discussion should emphasize point-in-time historical place engagement and request-time features like distance and open status. Unlike notification ranking, the system may tolerate slightly more latency in an in-app surface, but it needs stronger diversity so the top results are not ten identical restaurants from the same chain.

Common pitfalls

Pitfall: Treating ranking as a single model over all possible items.

A tempting but weak answer is “score every place or notification with a neural network and sort.” At Meta scale, the candidate space is too large, so you need retrieval layers, approximate search, source blending, filtering, and then ranking. Say explicitly how you reduce billions of possible items to thousands, then hundreds, then the final top `K`.

Pitfall: Ignoring online/offline feature mismatch.

Many candidates describe rich training features but never explain whether those values are available at serving time. A better answer names feature freshness, default values, point-in-time joins, and monitoring for missingness or drift. If a feature cannot be computed before the ranking request, it should not be in the online model.

Pitfall: Optimizing only click-through rate.

For recommendations and notifications, higher `CTR` can harm users if it increases spam, fatigue, hides, or low-quality engagement. A stronger answer frames ranking as utility optimization with guardrails: engagement, negative feedback, calibration, latency, fairness/slice performance, and long-term retention proxies where appropriate.

Connections

Interviewers may pivot from this topic into embedding-based retrieval, learning-to-rank, real-time feature serving, A/B testing for recommender systems, or model monitoring and drift detection. They may also ask how you would debug a drop in `NDCG@K`, a spike in feature missingness, or a mismatch between offline `AUC` gains and flat online metrics.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts