Recommender Systems and Ranking

What's being tested

Interviewers are checking that you can design, build, and operate a production-grade recommendation / ranking pipeline: selecting scalable candidate sources, training effective ranking models, and maintaining online/offline parity. They'll probe your knowledge of ranking losses, evaluation metrics, feature engineering for user/item temporal signals, and operational concerns like latency, drift detection, and model rollout. Netflix cares because small ranking changes materially affect engagement and streaming costs, so expect questions that combine algorithmic tradeoffs with deployment and monitoring responsibilities.

Core knowledge

Two-stage architecture: candidate generation (recall) then re-ranking (precision). Candidate stage reduces millions to ~100–2,000 items; final ranker optimizes engagement metrics under latency constraints.
Candidate techniques: collaborative filtering (matrix factorization), content-based retrieval, session-based sequence models, and embedding nearest-neighbor lookups; use `FAISS` / `Annoy` for approximate nearest neighbor (ANN) at scale.
Embedding models: learn user/item embeddings with dot-product or concatenation + MLP; store embeddings in a feature store for both training and low-latency serving. Embedding dims typically 32–512 depending on model complexity.
Ranking model families: pointwise (cross-entropy/MSE), pairwise (BPR: $\mathcal{L}=-\sum \log \sigma(s_{ui}-s_{uj}))$ , and listwise (e.g., LambdaRank/LambdaMART)—choose based on objective alignment with ranking metrics.
Ranking metrics: NDCG@k with DCG = $\sum_{i=1}^k \frac{2^{rel_i}-1}{\log_2(i+1)}$ and normalized by IDCG; also Recall@k, MRR, and session-level metrics. Offline metric gains should map to online engagement.
Counterfactual & bias correction: logged data has position bias and exposure selection; correct via inverse propensity scoring (IPS) or doubly robust estimators to reduce offline evaluation bias when policies changed.
Negative sampling & class imbalance: for implicit feedback, sample negatives carefully (uniform vs popularity-aware); use adaptive sampling to avoid biasing training toward easy negatives.
Sequence modeling: for temporal personalization, use Transformer / SASRec or lightweight LSTM/CNN; tradeoff: better accuracy vs higher inference cost—consider distilled or candidate-level features.
Latency & throughput constraints: final ranker must meet tail latency (p95/p99) SLOs; prefer shallow ensembles or trees (`XGBoost`/`LightGBM`) or optimized TorchScript models with batching for inference.
Offline/online parity: ensure feature computation and encoding are identical between training and serving; simulate production stale features in offline training to avoid skew.
Monitoring & drift: monitor input feature distributions, embedding similarity drift, online metric deltas, and model health (AUC/NDCG decay); set automated retrain or alert triggers when thresholds breach.
Scale knobs: candidate pool size, embedding dimension, ANN index type (IVF, HNSW), quantization levels; measure latency vs recall tradeoffs and iterate.

Tip: log raw exposure and impression events with enough context to compute propensities later; you won't build debiasing without them.

Worked example — design a two-stage personalized video recommender

Frame it: ask clarifying questions about latency SLO, traffic volume, freshness (real-time vs nightly), and available signals (watch history, device, position). Outline three pillars: (1) candidate generation: hybrid signals — session-based sequence model for short-term interest plus embedding ANN on long-term profile to produce ~1,000 diverse candidates; (2) feature enrichment: compute recency, watch-duration proxies, device/context features from the feature store with both online and offline representations; (3) final ranker: a latency-optimized model (shallow `XGBoost` or small TF model) trained with pairwise loss to prioritize engagement, tuned for NDCG@10. Flag tradeoffs: a deep Transformer improves recall but raises p99 latency — mitigate by distilling into a smaller model or running heavy model in background to update embeddings. Close with rollout plan: canary on small traffic with offline IPS-based evaluation, monitor NDCG/CTR and rollback criteria. "If I had more time" you'd prototype ANN index types with `FAISS` and add propensity-weighted offline evaluation.

A second angle — optimizing for unbiased offline evaluation after a UI change

Now assume a recent UI shuffle changed exposure propensities per position. The same ranking concepts apply but the emphasis shifts to counterfactual estimation: compute new position propensities from randomization buckets or use small randomized prompts to estimate exposure. Train with IPS-weighted loss or a doubly robust estimator to get less-biased offline gradients. Operationally, instrument all impressions and keep a compact log of candidate lists and shown positions; this enables accurate propensity calculation and safer offline model selection before full rollout. Here the MLE responsibilities are ensuring logging completeness, implementing IPS weighting in the training pipeline, and validating offline-to-online correlation.

Common pitfalls

Pitfall: Over-optimizing offline NDCG while ignoring exposure bias. Candidates often train on observed clicks without correcting for position/exposure; the model then chases UI artifacts and fails in production.

Pitfall: Ignoring inference tail latency. Proposing a large Transformer for final ranking without a deployment plan (quantization, batching, distillation) leads to missed SLOs and deployment rejection.

Pitfall: Training-serving skew from feature mismatch. Using enriched features in training that aren't computed identically at serving (staleness, different encoders) will produce optimistic offline metrics and poor online outcomes.

Connections

Interviewers may pivot to online experimentation (canarying, instrumentation) or causal inference (bandits, IPS), and to infrastructure topics like feature stores and ANN serving (`FAISS`). Be ready to discuss how model changes map to business metrics and the ops work to keep models reliable.