Recommendation Systems And Ranking
Asked of: Machine Learning Engineer
Last updated
What's being tested
Candidates must demonstrate practical design and engineering judgment for building low-latency, scalable recommendation and ranking pipelines: structuring candidate generation and reranking, ensuring online/offline feature parity, and handling feedback-driven learning (exploration/exploitation). Interviewers probe your ability to balance latency, freshness, and model quality while describing training/serving pipelines, evaluation metrics, and deployment/monitoring for iterative improvement. Expect to justify tradeoffs (batch vs. online learning, ANN memory vs. recall, exploration budget) from an MLE operational viewpoint.
Core knowledge
-
Two-stage architecture: candidate generation reduces billions→thousands using heuristics/embedding nearest-neighbors; reranker scores candidates with rich features for final ordering and personalization.
-
Feature freshness & consistency: use a feature store with both online (low-latency) and offline (batch) views; ensure training uses the same transformed features as serving to avoid training–serving skew.
-
Embeddings & ANN: learn item/user embeddings (e.g., SGD or matrix factorization); use
Faiss/product-quantization for ANN to serve nearest neighbors at scale—works for up to ~100M vectors with quantization. -
Latency budgets: set
p99SLO (e.g., <100ms for web, <20ms for mobile SDKs); push heavy computation offline or to candidate stage; reranker should be microseconds–milliseconds per candidate. -
Losses & objectives: choose objective aligned to business metric: binary cross-entropy for CTR, pairwise losses or LambdaRank for ranking (optimize
NDCG). For multi-objective, scalarize or use constrained optimization (Lagrangian) to trade off watch-time vs. CTR. -
Cold-start: combine content-based features (metadata, category, textual embeddings) and popularity/recency heuristics; use meta-learning or warm-start embeddings via side features.
-
Feedback-driven learning: apply contextual bandits for exploration-exploitation; evaluate policies with Inverse Propensity Scoring (IPS) where weight w = π(a|x)/π0(a|x), and prefer doubly-robust estimators to reduce variance.
-
Offline evaluation vs. online metrics: use
NDCG@k,AUC, calibration checks offline; validate with online metrics likeCTR,session-duration, retention. Be explicit about metric-optimizing loss mismatch. -
Negative sampling & label delay: for implicit feedback, carefully design negative sampling and account for delayed labels (conversions) to avoid label bias; consider censoring or survival analysis if delays are long.
-
Online learning & deployment: support incremental updates, periodic full retrains, and online updates for embeddings or shallow layers; use shadow serving and canary rollouts to validate model behavior before ramp.
-
Drift detection & monitoring: monitor feature distributions (KL divergence), model output distribution, and online metric shifts; automate alerts for upstream feed changes or feature holidays.
-
Privacy & fairness constraints: incorporate constraints (e.g., exposure caps) into ranking via post-processing (re-ranking) or constrained optimization, and log provenance for audits.
Worked example — "Design a real-time recommendation system"
First 30s: ask traffic/latency/memory SLOs, scale (DAU/items), acceptable exploration, and business objective (CTR, watch time, retention). State assumptions: 100M items, 50M DAU, p99 latency 100ms.
Skeleton answer pillars: (1) candidate generation (ANN on learned embeddings + time-decayed popularity and content filters), (2) feature-enriched reranker (gradient-boosted trees or small transformer using user/item/context features, cross-features), (3) training & feature pipeline (offline feature store, periodic retrain, online features via fast key-value store), (4) serving & rollout (low-latency microservice, shadow testing, canary).
Flag a tradeoff: ANN recall vs. latency—higher recall (larger probe count) improves candidate diversity but increases p99; prefer hybrid: static popularity + ANN top-K. Close with next steps: if more time, detail data schemas for features, show offline simulation of policy with IPS and design experiment allocation for safe exploration.
A second angle — "Design feedback-driven recommender"
This framing emphasizes online learning and exploration. Start by specifying the feedback loop latency and what counts as reward (click, watch-time normalized). Propose a contextual bandit layer on top of the baseline recommender to allocate exploration budget, instrument propensity logging for IPS/DR evaluation, and use Thompson Sampling or ε-greedy for initial exploration with decaying rates. Operational concerns: log full contexts and chosen-action propensities to enable unbiased offline evaluation; avoid catastrophic policy updates by constraining policy change per rollout. The core concepts (serving candidate/reranker separation, feature parity, monitoring) are the same but the priority shifts to safe online experimentation and reliable propensity bookkeeping.
Common pitfalls
Pitfall: Assuming offline metric improvement (e.g., lower training loss) directly transfers to online business metrics.
Many candidates optimize surrogate losses without addressing offline–online mismatch; explicitly discuss proxy-metric limitations and plan for online validation (shadow, canary).
Pitfall: Ignoring training–serving skew from missing/late features.
A tempting short answer is "use all signals"; stronger answers specify fallback features, imputation strategies, and tests that simulate production missingness during offline training.
Pitfall: Over-explaining infra details (e.g.,
Kafkapartitions) instead of ML decisions.
Focus on model behavior, feature freshness, exploration policy, and monitoring; mention infrastructure only to justify feasibility and latencies.
Connections
Interviewers may pivot to causal inference (long-term value estimation and off-policy evaluation), CTR calibration and uplift modeling, or MLE topics like feature-store architecture and continuous deployment patterns (shadow traffic, canaries).
Further reading
-
Recommender Systems Handbook (chapter on scalable training/serving) — comprehensive systems and algorithms overview.
-
[Agarwal et al., “Counterfactual Evaluation for Recommender Systems”] — practical methods for IPS/doubly-robust policy evaluation.
Practice questions
Related concepts
- Recommender And Ranking SystemsMachine Learning
- Ranking, Recommendation, And Feedback SystemsML System Design
- Recommendation System DesignML System Design
- Candidate Generation, Ranking, And Feature StoresML System Design
- Recommender And Ranking System Design
- Recommender Systems, Feed Ranking, And Marketplace MetricsMachine Learning