Recommendation System Design

What's being tested

Interviewers are probing your ability to design a production-grade recommendation pipeline end-to-end from an ML engineering perspective: defining offline metrics and datasets, building reproducible training pipelines, choosing candidate-generation and scoring patterns, ensuring online/offline parity, and instrumenting serving and drift monitoring. They want to see concrete tradeoffs (latency vs model complexity, freshness vs accuracy), logging/observability choices that enable reliable A/B testing, and how you protect user privacy while preserving evaluation fidelity.

Core knowledge

Two-stage architecture (candidate generation + scoring) — generate O(10^2–10^4) candidates via fast retrieval (popularity, embedding nearest-neighbors, ANN) then apply a heavy scoring model to rank; reduces latency and compute cost per impression.
Candidate retrieval options — use popularity/recency for baseline, collaborative filtering (ALS, matrix factorization) for coarse personalization, and embedding-based retrieval (Faiss, HNSW) for semantic matches; embeddings scale to millions with ANN.
Scoring models — pointwise Logistic or pairwise/ranking losses (BPR, LambdaRank) and gradient-boosted trees (XGBoost, LightGBM) or shallow DNNs; balance expressiveness with predict latency (target p99 < feed slot SLA).
Feature freshness & online features — separate static (content metadata) and online (recent watch, session signals) features; aim for feature freshness windows (e.g., 1–10s for session signals, minutes for aggregates) and measure offline/online parity.
Offline evaluation metrics — use CTR, watch time (weighted), NDCG@k, MRR; when optimizing watch time, report inverse effects on session length and retention. Quantify: NDCG differences require many impressions to be significant.
Counterfactual evaluation — apply inverse propensity scoring (IPS) to estimate policy changes offline: $\hat{R}_{IPS}=\frac{\sum_i \frac{\pi_{new}(a_i|x_i)}{\pi_{old}(a_i|x_i)} r_i}{\sum_i \frac{\pi_{new}(a_i|x_i)}{\pi_{old}(a_i|x_i)}}\,.$ Requires logged propensities and good coverage.
Experimentation & metrics — instrument exposures and downstream conversions; design guardrail metrics (DAU, retention, content-safety flags). Use sequential testing corrections (e.g., Bonferroni or sequential probability ratio tests) for multiple metrics.
Production pipelines & reproducibility — train via orchestration (Airflow, Kubeflow), store features in a feature store (Feast style) with lineage and schema versioning; persist model artifacts with deterministic hashes.
Serving & latency — two serving patterns: online scoring (model per-impression, lower candidate count) and pre-scoring/cache (precompute top-k per user or per cohort). Target p99 SLA; quantify CPU/GPU budget per request.
Monitoring & drift detection — monitor prediction distributions, calibration, feature drift, and label distribution; set automated alerts for p99 latency spikes and sudden metric regressions.
Privacy & compliance — only reference user signals as discrete feature sources; apply differential privacy/aggregation or hashing for PII and ensure logging complies with retention policies.
Exploration-exploitation — implement exploration via epsilon-greedy, Thompson Sampling, or controlled randomized buckets; track exploration’s offline impact via IPS and online via uplift tests.

Worked example: Design a video recommendation system

First 30 seconds: clarify goals (maximize immediate watch_time vs long-term retention?), constraints (per-impression p99 latency budget, QPS, offline storage limits), privacy/regulatory requirements, and whether we're personalizing for logged-out users. Skeleton answer pillars: (1) Data & metrics — define events to log (impressions, clicks, watch duration), guardrail metrics, and offline label construction; (2) Architecture — two-stage pipeline (retrieval via embeddings/popularity, then a scoring model with re-ranking for diversity/safety); (3) Training & infra — ETL to a feature store, offline batch training with periodic incremental updates, CI for model artifacts; (4) Serving & monitoring — low-latency scorer, logging exposures and propensities, drift/SLI dashboards. Explicit tradeoff: choose between a very deep DNN that improves NDCG by a small percentage but pushes p99 latency beyond SLA versus a cascade of smaller models and a learned re-ranker — I’d favor cascade to preserve UX. Close: “If I had more time I’d design detailed logging for propensity scores, run IPS offline experiments to tune exploration, and prototype a cold-start flow using creator/content embeddings.”

A second angle

Reframe toward a short-form mobile feed with severe memory and latency constraints: emphasis shifts to extreme candidate pruning, on-device caches, and session-aware features. Here retrieval must favor freshness and short-term engagement signals; the scoring model might be a compact on-device model (quantized ONNX), with heavier personalization run server-side to update cache periodically. Exploration becomes critical to prevent stale loops; use lightweight randomized contextual buckets and evaluate with short-horizon retention metrics rather than only CTR.

Common pitfalls

Pitfall: Optimizing only for immediate CTR and ignoring long-term metrics like retention or session_depth — this often increases short-term clicks but harms product health. Always propose guardrail metrics and consider multi-objective loss or post-hoc business rules.

Pitfall: Designing offline evaluations without logging propensities — you cannot reliably estimate counterfactual performance; insist on instrumenting the serving layer to record selection probabilities.

Pitfall: Presenting a single monolithic model without addressing latency, caching, and feature freshness — interviewers expect a cascade/caching plan and concrete p99 SLO tradeoffs.

Connections

Interviewers commonly pivot to adjacent areas like online experimentation & metric design (statistical power, multiple comparisons) and feature-store/serving infra (serving parity, feature delivery latency). They may also ask about ranking fairness or content-moderation pipelines that interact with the recommender.

What's being tested

Core knowledge

Worked example: Design a video recommendation system

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts