Recommendation Systems And Ranking

What's being tested

Candidates must demonstrate practical design and engineering judgment for building low-latency, scalable recommendation and ranking pipelines: structuring candidate generation and reranking, ensuring online/offline feature parity, and handling feedback-driven learning (exploration/exploitation). Interviewers probe your ability to balance latency, freshness, and model quality while describing training/serving pipelines, evaluation metrics, and deployment/monitoring for iterative improvement. Expect to justify tradeoffs (batch vs. online learning, ANN memory vs. recall, exploration budget) from an MLE operational viewpoint.

Core knowledge

Two-stage architecture: candidate generation reduces billions→thousands using heuristics/embedding nearest-neighbors; reranker scores candidates with rich features for final ordering and personalization.
Feature freshness & consistency: use a feature store with both online (low-latency) and offline (batch) views; ensure training uses the same transformed features as serving to avoid training–serving skew.
Embeddings & ANN: learn item/user embeddings (e.g., SGD or matrix factorization); use Faiss/product-quantization for ANN to serve nearest neighbors at scale—works for up to ~100M vectors with quantization.
Latency budgets: set p99 SLO (e.g., <100ms for web, <20ms for mobile SDKs); push heavy computation offline or to candidate stage; reranker should be microseconds–milliseconds per candidate.
Losses & objectives: choose objective aligned to business metric: binary cross-entropy for CTR, pairwise losses or LambdaRank for ranking (optimize NDCG). For multi-objective, scalarize or use constrained optimization (Lagrangian) to trade off watch-time vs. CTR.
Cold-start: combine content-based features (metadata, category, textual embeddings) and popularity/recency heuristics; use meta-learning or warm-start embeddings via side features.
Feedback-driven learning: apply contextual bandits for exploration-exploitation; evaluate policies with Inverse Propensity Scoring (IPS) where weight w = π(a|x)/π0(a|x), and prefer doubly-robust estimators to reduce variance.
Offline evaluation vs. online metrics: use NDCG@k, AUC, calibration checks offline; validate with online metrics like CTR, session-duration, retention. Be explicit about metric-optimizing loss mismatch.
Negative sampling & label delay: for implicit feedback, carefully design negative sampling and account for delayed labels (conversions) to avoid label bias; consider censoring or survival analysis if delays are long.
Online learning & deployment: support incremental updates, periodic full retrains, and online updates for embeddings or shallow layers; use shadow serving and canary rollouts to validate model behavior before ramp.
Drift detection & monitoring: monitor feature distributions (KL divergence), model output distribution, and online metric shifts; automate alerts for upstream feed changes or feature holidays.
Privacy & fairness constraints: incorporate constraints (e.g., exposure caps) into ranking via post-processing (re-ranking) or constrained optimization, and log provenance for audits.

Worked example — "Design a real-time recommendation system"

First 30s: ask traffic/latency/memory SLOs, scale (DAU/items), acceptable exploration, and business objective (CTR, watch time, retention). State assumptions: 100M items, 50M DAU, p99 latency 100ms.

Skeleton answer pillars: (1) candidate generation (ANN on learned embeddings + time-decayed popularity and content filters), (2) feature-enriched reranker (gradient-boosted trees or small transformer using user/item/context features, cross-features), (3) training & feature pipeline (offline feature store, periodic retrain, online features via fast key-value store), (4) serving & rollout (low-latency microservice, shadow testing, canary).

Flag a tradeoff: ANN recall vs. latency—higher recall (larger probe count) improves candidate diversity but increases p99; prefer hybrid: static popularity + ANN top-K. Close with next steps: if more time, detail data schemas for features, show offline simulation of policy with IPS and design experiment allocation for safe exploration.

A second angle — "Design feedback-driven recommender"

This framing emphasizes online learning and exploration. Start by specifying the feedback loop latency and what counts as reward (click, watch-time normalized). Propose a contextual bandit layer on top of the baseline recommender to allocate exploration budget, instrument propensity logging for IPS/DR evaluation, and use Thompson Sampling or ε-greedy for initial exploration with decaying rates. Operational concerns: log full contexts and chosen-action propensities to enable unbiased offline evaluation; avoid catastrophic policy updates by constraining policy change per rollout. The core concepts (serving candidate/reranker separation, feature parity, monitoring) are the same but the priority shifts to safe online experimentation and reliable propensity bookkeeping.

Common pitfalls

Pitfall: Assuming offline metric improvement (e.g., lower training loss) directly transfers to online business metrics.
Many candidates optimize surrogate losses without addressing offline–online mismatch; explicitly discuss proxy-metric limitations and plan for online validation (shadow, canary).

Pitfall: Ignoring training–serving skew from missing/late features.
A tempting short answer is "use all signals"; stronger answers specify fallback features, imputation strategies, and tests that simulate production missingness during offline training.

Pitfall: Over-explaining infra details (e.g., Kafka partitions) instead of ML decisions.
Focus on model behavior, feature freshness, exploration policy, and monitoring; mention infrastructure only to justify feasibility and latencies.

Connections

Interviewers may pivot to causal inference (long-term value estimation and off-policy evaluation), CTR calibration and uplift modeling, or MLE topics like feature-store architecture and continuous deployment patterns (shadow traffic, canaries).

What's being tested

Core knowledge

Worked example — "Design a real-time recommendation system"

A second angle — "Design feedback-driven recommender"

Common pitfalls

Connections

Further reading

Practice questions

Related concepts