ML Observability And Production Monitoring

What's being tested

Candidates must demonstrate practical mastery of designing ML observability for a production recommendation system: what to instrument, how to detect and triage data/model drift, and how monitoring feeds deployment and retraining decisions. Interviewers probe your ability to choose concrete metrics, detection algorithms, alerting rules, and lightweight remediation patterns that an MLE owns (not deep product A/B design or upstream ETL plumbing).

Core knowledge

Monitoring layers: separate infrastructure, data, model, and business monitoring; each has different owners and SLOs — MLE owns model + data signals and their mapping to business metrics.
Key business metrics: CTR, session watch-time, DAU, and retention; map model outputs to business KPIs via one-to-one dashboards and automated alerts for large deviations.
Model-level signals: track prediction distribution, confidence/calibration, top-k score distributions, top-N coverage, and offline metrics like AUC or NDCG for ranking pipelines.
Data drift vs concept drift: feature drift = input distribution change; label drift (or concept drift) = change in p(y|x); detect feature drift with PSI and KS tests; concept drift requires label-feedback and online evaluation.
Statistical detectors: use Population Stability Index (PSI): $\text{PSI}=\sum (P_i-Q_i)\ln\frac{P_i}{Q_i}$ heuristics: PSI>0.25 often flagged; use EWMA, CUSUM, and Page–Hinkley for change point detection on rates; tune window size for sensitivity vs noise.
Latency and correctness SLOs: track p95/p99 latency, error rates, and tail failures; set SLOs (e.g., p99 < 200ms) and use budget-based alerting to avoid noisy pages.
Feature parity and freshness: monitor online/offline parity by shadowing model inputs and comparing feature histograms and transformation logic; log feature hashes to detect schema or transformation divergence.
Ground-truth labeling and delay: measure label feedback delay and its impact on retraining frequency; compute effective sample size given label latency before trusting offline validation.
Sampling and logging strategy: use deterministic sampling (seeded) and selective enrichment to log full context for a small fraction of requests (e.g., 0.5–2%) to enable root-cause without full logging cost.
Shadowing and canarying: run models in shadow mode and compare outputs, use canary traffic to validate at small scale; measure business metric deltas and model agreement metrics before ramp.
Instrumentation tools: use Prometheus + Grafana for time series, Sentry for errors, MLflow or Weights & Biases for model metadata, and Feast or your feature store for lineage and freshness tracking.
Alerting design: prefer aggregated, deterministic alerts (e.g., sustained >X sigma for Y minutes) and multi-signal alerts (data + model + business) to reduce false positives and pager fatigue.

Worked example — Design a video recommendation system

First 30 seconds: ask clarifying questions about latency SLOs, expected traffic, label availability (are watch events reliable?), and privacy constraints. Organize the answer into pillars: (1) data & feature pipelines (signals, freshness, feature registry); (2) model training and evaluation (offline metrics, validation windows); (3) serving & rollout (latency, caching, shadowing); (4) observability and retraining loop (what to monitor, thresholds, automated retrain triggers). For observability, propose concrete artifacts: sampled raw logs, per-feature PSI, prediction-score histogram, top-k selection coverage, online calibration checks, and business KPI dashboards. Flag a tradeoff: tighter freshness increases compute and cost — choose incremental feature materialization for high-value features and hourly batch for others. Close by proposing phased rollout: shadow → canary → gradual ramp with automated rollback on multi-signal alerts; if more time, add offline synthetic tests, adversarial perturbation tests, and automated root-cause playbooks.

A second angle

If the constraint is extreme latency (mobile clients with 100ms budget) emphasize a different monitoring focus: lightweight edge models and mobile-side telemetry. Instrument model sizes, inference time per-device, and on-device feature versions. Because labels arrive asynchronously, rely more on proxy signals (video start rate, immediate short-term engagement) and strict online/offline parity checks (hash checksums of preprocessing code). The same observability concepts apply, but you trade full-context logging for compact diagnostics and heavier pre-deployment shadow testing.

Common pitfalls

Pitfall: equating small metric drift with model failure.
Alerting on tiny but statistically significant shifts without business relevance creates noise; require multi-signal corroboration (feature PSI + prediction shift + KPI delta) before escalation.

Pitfall: only monitoring aggregate metrics.
Aggregate CTR can hide distributional failures (e.g., certain content types or user cohorts degrading). Always slice by important dimensions and expose top-k cohort deltas.

Pitfall: neglecting upstream parity and transformation drift.
The common wrong answer is "retrain more frequently" — better is to detect transformation or schema mismatches (feature missingness, silent defaulting) since retraining won't fix corrupted features.

Connections

Model observability often leads to adjacent pivots: automated retraining pipelines (CI/CD for models) and feature store governance (feature lineage and access controls). Interviewers may also pivot to evaluation design or experiment analysis to validate monitoring-driven rollbacks.

What's being tested

Core knowledge

Worked example — Design a video recommendation system

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts