Production ML Serving, Feature Stores, And Monitoring

What's being tested

Candidates must show operational ownership of production ML serving, feature stores, and monitoring: designing low‑latency, consistent online inference; ensuring online/offline feature parity; and detecting/permitting safe retraining when model quality or input distributions shift. Interviewers probe tradeoffs (latency vs. model complexity, freshness vs. compute), reproducibility (versioning, lineage), and actionable monitoring (what to track and how to trigger remediation).

Core knowledge

Feature store architecture: separate online store (low‑latency key-value lookups) and offline store (batch historical joins). Point-in-time joins are mandatory to avoid label leakage during training.
Training-serving skew causes: different feature computation pipelines, stale features, aggregation windows mismatch, or implicit training-time access to future labels; fix with identical featurization code or a single store like Feast/Tecton.
Latency budgets and QoS: optimize for p99 tail latency; techniques include feature caching, model quantization/compilation (e.g., ONNX, TensorRT), asynchronous batching, and light-weight fallback models for timeouts.
Online feature freshness and TTL: specify freshness SLA per feature (e.g., user_last_activity < 1s for realtime personalization), materialize vs. compute-on-read tradeoff depends on QPS and compute cost.
Model deployment patterns: shadowing (mirror traffic), canary rollouts, and blue/green; ensure per-request deterministic model versioning and log model version, features, prediction for offline debugging and replay.
Drift & alerting signals: monitor input feature distributions (PSI, KL/JS divergence, KS test), label/target metrics (AUC, calibration), and business metrics (CTR, revenue). Use seasonality-aware thresholds and rolling baselines (7/28‑day windows).
Automated retrain triggers: combine statistical drift (e.g., PSI > 0.2) WITH performance degradation beyond tolerance (e.g., AUC drop > 2% absolute), plus minimum sample thresholds and human review gating.
OOD & confidence detection: monitor prediction entropy, softmax max, or use density/OOD detectors (Mahalanobis distance, ODIN). Low confidence should map to fallback logic or human review.
Logging & observability: log feature vector, entity key, timestamp, model version, prediction, and downstream signal arrival time; ensure logs are immutable and indexed for point-in-time replay and root-cause analysis.
Label delay and online evaluation: for delayed labels, use proxy metrics (engagement signals), offline simulated online evaluation using historical playback with point-in-time joins, and shadow experiments to estimate real-time impact.
Resource and cost controls: choose batching window and accelerator sizing to optimize latency vs. cost; micro-batching (e.g., 10–100ms) often improves GPU utilization but increases tail latency.
Governance & reproducibility: record model artifacts, feature definitions, data snapshots, and training code versions; produce lightweight model cards and maintain lineage to comply with audits and rollback.

Worked example — Design a real-time recommendation system

First 30 seconds: clarify QPS, latency SLO (e.g., 50ms p95), cold-start constraints, label delay, and whether recommendations must be personalized per request or session-level. Also ask which business metric (CTR, watch-time, revenue) matters.

Skeleton answer pillars: (1) Candidate generation (approximate nearest neighbors, popularity filters, or learned recall using embeddings), (2) Feature computation & store (define entity keys, online vs offline features, point-in-time joins), (3) Ranking model & serving (low‑latency model optimized via quantization/ONNX or distilled models; use batching and caching), (4) Feedback loop & training (log impressions, clicks, delayed labels; offline replay and periodic retrain or online updates), (5) Monitoring & experiment infra (real-time feature distribution monitors, quality alerts, canaries).

Explicit tradeoff: deeper neural ranker improves CTR but increases tail latency; prefer two‑stage architecture (lightweight ranker at p95, heavyweight reranker for top-K in async) to balance latency and quality. Closing: if more time, detail specific similarity index (HNSW) for candidate retrieval, exact feature schemas and point-in-time join examples, and rollout plan with success metrics and rollback thresholds.

A second angle — Design feedback-driven recommender

Here the emphasis shifts to online learning and exploration–exploitation. Frame assumptions: is online update acceptable per interaction, or do we need batched updates? Main pillars: (1) contextual bandit policy (Thompson Sampling / UCB) for controlled exploration, (2) real-time feature ingestion into the online store for immediate personalization, (3) safe logging and reward attribution (credit assignment when delayed), (4) guardrails for user experience (cap exploration rate, guard top-N). Tradeoffs include exploration cost (short-term metric loss) versus long-term learning. Also call out off-policy evaluation (IPS, DR) and importance-sampling variance issues when measuring new policies from logged data.

Common pitfalls

Pitfall: Training-serving skew — teams often keep separate featurization code; this produces subtle biases that only appear in production. Always run the same featurization (library or hosted feature store) for train and serve.

Pitfall: Over-reacting to noisy alerts — too-sensitive thresholds trigger unnecessary retrains. Combine statistical significance, minimum sample size, and business-impact filters before invoking pipelines.

Pitfall: Ignoring tail latency — optimizing average latency while p99 remains high leads to user-visible slowdowns; design around tail behaviors (pre-warming, request hedging, fallback models).

Connections

Interviewers may pivot to model evaluation & experimentation (A/B testing design, off-policy evaluation), data labeling and label pipelines (label quality impacts retraining cadence), or infrastructure tradeoffs (edge vs cloud deployment trade-offs for serving).

What's being tested

Core knowledge

Worked example — Design a real-time recommendation system

A second angle — Design feedback-driven recommender

Common pitfalls

Connections

Further reading

Practice questions

Related concepts