Production ML Infrastructure and Monitoring

What's being tested

Interviewers are probing whether you can operationalize machine learning so models deliver sustained business value: stakeholder engagement, production deployment patterns, monitoring for data and concept drift, and an iteration cadence that balances velocity with safety. They want to see practical decisions an ML Engineer makes—what to instrument, what SLOs to set, how to triage incidents, and how to enable data scientists to iterate without breaking production.

Core knowledge

Stakeholder mapping: identify consumer teams, owners, and SLAs (e.g., prediction latency SLO, accuracy SLO, data freshness SLA). Capture who owns mitigation decisions and rollback authority.
Feature store fundamentals: use a feature store like Feast to guarantee online/offline parity, deterministic joins, and single-source-of-truth features; store both materialized online features and batch feature views.
Serving patterns: know canary/blue-green/rollout-by-percentage strategies and when to use synchronous serving (TF Serving, Seldon) versus asynchronous batch scoring; trade latency vs throughput and statefulness.
Instrumentation signals: short-term: p95/p99 latency, error rates, throughput; model signals: business KPI delta, model performance (AUC, precision@k), input-feature distributions, and prediction distribution.
Drift detection math: monitor data drift with metrics like Population Stability Index (PSI) and KL divergence; for continuous features compute PSI per window: $PSI=\sum (p_i-q_i) \ln\frac{p_i}{q_i}$ and set empirical thresholds (e.g., PSI>0.2).
Concept vs data drift: detect concept drift by tracking label-conditional performance change and using holdout or retrospective labeling pipelines; data drift can exist without performance loss.
Label latency and feedback loops: quantify label availability delay and design delayed-eval pipelines; use shadow traffic and offline-simulated labels when labels are slow or missing.
Alerting and SLOs: set alerts on symptom and cause—symptom: business KPI drop; cause: feature distribution shift or increased p99 latency. Define actionable thresholds to avoid alert fatigue (use multi-signal gates).
Reproducible training pipelines: use CI for model training (MLflow for experiments, artifact versioning), capture data and code hashes, and register models with metadata (training data snapshot, hyperparameters).
Rollback and mitigation: automate safe fallbacks—return to last-known-good model, apply admission control (e.g., fallback to rule-based baseline), or throttle traffic.
Privacy & governance: ensure feature access follows data governance; redact PII before serving and document model cards and data lineage for audits.
Cost vs fidelity tradeoffs: decide sampling rate for monitoring (1–5% for high-cost signals) versus full-traffic capture; compute approximate cost by queries-per-second × retention window.

Tip: instrument the minimal set of signals that make incidents actionable—capture root-cause evidence with every alert (feature snapshot, prediction, request id).

Worked example — "How would you support ML stakeholders?"

First 30 seconds: ask who the stakeholders are (data scientists, product, ops), what SLAs and KPIs they care about (e.g., CTR lift, streaming latency), and how labels are produced (real-time vs delayed). Organize the response into three pillars: (1) Onboarding & observability, (2) Safe deployment & iteration, (3) Governance & feedback loops. For onboarding, propose standard docs, MLflow-backed model registry, and a model card for expectations. For observability, propose dashboards in Grafana/Datadog showing business KPI, model metrics, and feature PSI; set tiered alerts (warning → critical) and associate runbooks. For deployment, propose canary rollout with shadow traffic and automated rollback on predefined triggers (e.g., >5% KPI degradation or PSI>0.2). A concrete tradeoff to flag: sampling more data for deep forensics increases storage and latency; calibrate sampling to risk (e.g., critical Recommender path gets higher fidelity). Close by saying that with more time you'd implement automated retrain pipelines, documented SLAs with stakeholders, and monthly post-mortems that feed prioritized engineering tickets.

A second angle — monitoring model degradation and responding

Reframe to monitoring: categorize detection into two buckets—data-signal monitoring (feature distributions, missingness, schema changes) and performance monitoring (prediction quality vs labels, business KPI trend). For real-time systems without immediate labels, use proxy signals: input distribution shifts, prediction-confidence histograms, and user engagement short-timescale metrics. When drift is detected, run a scaffolding response: (1) triage using per-feature PSI and partial dependence checks, (2) block automated model promotions if severity exceeds thresholds, (3) run a shadow retrain on recent data to estimate recovery. Communicate to stakeholders with a concise incident message: impact on KPI, root-cause hypothesis, and immediate mitigation (rollback or throttle). Emphasize measuring false-positive alert rate and tuning to avoid unnecessary developer churn.

Common pitfalls

Pitfall: Confusing symptom with cause — alerting only on business KPI drops without instrumenting feature distributions or request logs makes root-cause discovery slow. Always pair business alerts with causal signals.

Pitfall: Over-instrumenting without ownership — shipping dozens of metrics without documented owners leads to alert fatigue and stale dashboards. Assign metric ownership and a retirement process.

Pitfall: Treating offline metrics as sufficient — relying solely on offline validation (AUC on historical holdout) ignores serving-time issues like stale features, label shift, and latency-induced timeouts; prove parity with shadow testing.

Connections

This topic often pivots to Experimentation (A/B testing model variants and sequential testing), Data Engineering (data contracts and ingestion reliability), and Product (translating model performance into business KPIs and launch decisions).

What's being tested

Core knowledge

Worked example — "How would you support ML stakeholders?"

A second angle — monitoring model degradation and responding

Common pitfalls

Connections

Further reading

Practice questions

Related concepts