Recommendation System Design
Asked of: Machine Learning Engineer
Last updated
What's being tested
Interviewers are probing your ability to design a production-grade recommendation pipeline end-to-end from an ML engineering perspective: defining offline metrics and datasets, building reproducible training pipelines, choosing candidate-generation and scoring patterns, ensuring online/offline parity, and instrumenting serving and drift monitoring. They want to see concrete tradeoffs (latency vs model complexity, freshness vs accuracy), logging/observability choices that enable reliable A/B testing, and how you protect user privacy while preserving evaluation fidelity.
Core knowledge
-
Two-stage architecture (candidate generation + scoring) — generate O(10^2–10^4) candidates via fast retrieval (popularity, embedding nearest-neighbors,
ANN) then apply a heavy scoring model to rank; reduces latency and compute cost per impression. -
Candidate retrieval options — use popularity/recency for baseline, collaborative filtering (
ALS, matrix factorization) for coarse personalization, and embedding-based retrieval (Faiss, HNSW) for semantic matches; embeddings scale to millions withANN. -
Scoring models — pointwise
Logisticor pairwise/ranking losses (BPR,LambdaRank) and gradient-boosted trees (XGBoost,LightGBM) or shallow DNNs; balance expressiveness with predict latency (targetp99< feed slot SLA). -
Feature freshness & online features — separate static (content metadata) and online (recent watch, session signals) features; aim for feature freshness windows (e.g., 1–10s for session signals, minutes for aggregates) and measure offline/online parity.
-
Offline evaluation metrics — use
CTR, watch time (weighted),NDCG@k,MRR; when optimizing watch time, report inverse effects onsession lengthandretention. Quantify:NDCGdifferences require many impressions to be significant. -
Counterfactual evaluation — apply inverse propensity scoring (IPS) to estimate policy changes offline: Requires logged propensities and good coverage.
-
Experimentation & metrics — instrument exposures and downstream conversions; design guardrail metrics (
DAU,retention,content-safetyflags). Use sequential testing corrections (e.g., Bonferroni or sequential probability ratio tests) for multiple metrics. -
Production pipelines & reproducibility — train via orchestration (
Airflow,Kubeflow), store features in a feature store (Feaststyle) with lineage and schema versioning; persist model artifacts with deterministic hashes. -
Serving & latency — two serving patterns: online scoring (model per-impression, lower candidate count) and pre-scoring/cache (precompute top-k per user or per cohort). Target
p99SLA; quantify CPU/GPU budget per request. -
Monitoring & drift detection — monitor prediction distributions, calibration, feature drift, and label distribution; set automated alerts for
p99latency spikes and sudden metric regressions. -
Privacy & compliance — only reference user signals as discrete feature sources; apply differential privacy/aggregation or hashing for PII and ensure logging complies with retention policies.
-
Exploration-exploitation — implement exploration via epsilon-greedy, Thompson Sampling, or controlled randomized buckets; track exploration’s offline impact via
IPSand online via uplift tests.
Worked example: Design a video recommendation system
First 30 seconds: clarify goals (maximize immediate watch_time vs long-term retention?), constraints (per-impression p99 latency budget, QPS, offline storage limits), privacy/regulatory requirements, and whether we're personalizing for logged-out users. Skeleton answer pillars: (1) Data & metrics — define events to log (impressions, clicks, watch duration), guardrail metrics, and offline label construction; (2) Architecture — two-stage pipeline (retrieval via embeddings/popularity, then a scoring model with re-ranking for diversity/safety); (3) Training & infra — ETL to a feature store, offline batch training with periodic incremental updates, CI for model artifacts; (4) Serving & monitoring — low-latency scorer, logging exposures and propensities, drift/SLI dashboards. Explicit tradeoff: choose between a very deep DNN that improves NDCG by a small percentage but pushes p99 latency beyond SLA versus a cascade of smaller models and a learned re-ranker — I’d favor cascade to preserve UX. Close: “If I had more time I’d design detailed logging for propensity scores, run IPS offline experiments to tune exploration, and prototype a cold-start flow using creator/content embeddings.”
A second angle
Reframe toward a short-form mobile feed with severe memory and latency constraints: emphasis shifts to extreme candidate pruning, on-device caches, and session-aware features. Here retrieval must favor freshness and short-term engagement signals; the scoring model might be a compact on-device model (quantized ONNX), with heavier personalization run server-side to update cache periodically. Exploration becomes critical to prevent stale loops; use lightweight randomized contextual buckets and evaluate with short-horizon retention metrics rather than only CTR.
Common pitfalls
Pitfall: Optimizing only for immediate
CTRand ignoring long-term metrics likeretentionorsession_depth— this often increases short-term clicks but harms product health. Always propose guardrail metrics and consider multi-objective loss or post-hoc business rules.
Pitfall: Designing offline evaluations without logging propensities — you cannot reliably estimate counterfactual performance; insist on instrumenting the serving layer to record selection probabilities.
Pitfall: Presenting a single monolithic model without addressing latency, caching, and feature freshness — interviewers expect a cascade/caching plan and concrete
p99SLO tradeoffs.
Connections
Interviewers commonly pivot to adjacent areas like online experimentation & metric design (statistical power, multiple comparisons) and feature-store/serving infra (serving parity, feature delivery latency). They may also ask about ranking fairness or content-moderation pipelines that interact with the recommender.
Further reading
-
YouTube Recommendations: Algorithms, Views and Implications (Covington et al., 2016) — practical two-stage retrieval + deep ranking design from a large-scale video service.
-
[Counterfactual Risk Minimization for Recommender Systems (Joachims et al.)] — explains
IPSand counterfactual evaluation for logged-bandit feedback.
Practice questions
Related concepts
- Recommender And Ranking SystemsMachine Learning
- Ranking, Recommendation, And Feedback SystemsML System Design
- Recommender, Ranking, And Ads ML Systems
- Recommender Systems, Feed Ranking, And Marketplace MetricsMachine Learning
- Candidate Generation, Ranking, And Feature StoresML System Design
- Recommender Systems And Feed RankingMachine Learning