Production ML Validation And Monitoring — Tech Interview Concept

What's being tested
Ability to design and operate reliable post-deployment checks: choose sensible metrics and statistical tests, detect real issues (drift, label problems, regression), and specify remediation and deployment workflows that balance sensitivity and noise.

Core knowledge

Key production metrics: AUC/ROC, log-loss, calibration (reliability curve, Brier score), latency, error-rate by slice.
Data-drift tests: PSI, KS-test for univariate; classifier-based two-sample tests for multivariate.
Concept/label shift: detect with confusion-matrix shifts, EM or importance-weighting corrections.
Drift detectors: ADWIN, Page-Hinkley for streaming, population stability index for batches.
Instrumentation: Prometheus/Grafana for metrics, TFDV/Great Expectations for data validation, Feast for feature stores.
Deployment patterns: shadow/canary, champion–challenger, automatic rollback thresholds.
Sampling and ground truth: stratified label sampling, labeling latency impact, statistical power and multiple-testing control.

Worked example — "How would you detect and respond to data drift in production?"
Start by defining which distributions to monitor: raw features, model inputs (after featurization), and prediction scores. Establish baselines (training/validation) and run univariate PSI/KS tests plus a classifier-based two-sample test for multivariate shifts. Concurrently track model performance on sampled labeled data (latency-aware); if drift detected, triage whether it's covariate shift, label shift, or upstream data-quality. Specify thresholds and actions: increase label sampling, deploy a canary model or rollback, retrain with recent data, and add root-cause alerts for feature schema changes or missingness.

A common pitfall
Focusing solely on overall model metric drops (e.g., accuracy) and immediately retraining. Many problems are upstream (schema changes, missingness, label delay) or transient noise. Retraining without diagnosing label shift or sampling bias can perpetuate failure modes and create feedback loops.

Related concepts