Production ML Validation And Monitoring
Asked of: Data Scientist
Last updated
What's being tested
Ability to design and operate reliable post-deployment checks: choose sensible metrics and statistical tests, detect real issues (drift, label problems, regression), and specify remediation and deployment workflows that balance sensitivity and noise.
Core knowledge
- Key production metrics: AUC/ROC, log-loss, calibration (reliability curve, Brier score), latency, error-rate by slice.
- Data-drift tests: PSI, KS-test for univariate; classifier-based two-sample tests for multivariate.
- Concept/label shift: detect with confusion-matrix shifts, EM or importance-weighting corrections.
- Drift detectors: ADWIN, Page-Hinkley for streaming, population stability index for batches.
- Instrumentation: Prometheus/Grafana for metrics, TFDV/Great Expectations for data validation, Feast for feature stores.
- Deployment patterns: shadow/canary, champion–challenger, automatic rollback thresholds.
- Sampling and ground truth: stratified label sampling, labeling latency impact, statistical power and multiple-testing control.
Worked example — "How would you detect and respond to data drift in production?"
Start by defining which distributions to monitor: raw features, model inputs (after featurization), and prediction scores. Establish baselines (training/validation) and run univariate PSI/KS tests plus a classifier-based two-sample test for multivariate shifts. Concurrently track model performance on sampled labeled data (latency-aware); if drift detected, triage whether it's covariate shift, label shift, or upstream data-quality. Specify thresholds and actions: increase label sampling, deploy a canary model or rollback, retrain with recent data, and add root-cause alerts for feature schema changes or missingness.
A common pitfall
Focusing solely on overall model metric drops (e.g., accuracy) and immediately retraining. Many problems are upstream (schema changes, missingness, label delay) or transient noise. Retraining without diagnosing label shift or sampling bias can perpetuate failure modes and create feedback loops.
Further reading
- Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (2015) — operational failure modes.
- Breck et al., "The ML Test Score" (practical rubric for production readiness).
Related concepts
- Production ML Pipelines And System DesignML System Design
- ML Model Evaluation, Metrics, And ExperimentationML System Design
- ML Model Evaluation And Calibration
- Supervised ML Workflows, Interpretability And DeploymentMachine Learning
- Machine Learning Model Evaluation And CalibrationMachine Learning
- LLM Evaluation, Offline Metrics, Online Monitoring, and Regression Testing