Debugging a Sudden Accuracy Drop in a Deployed ML Pipeline
Context
You are on-call for a production machine learning service. Monitoring alerts show that model accuracy, which had been stable, suddenly dropped after a deployment. Labels may arrive with a delay, and traffic patterns can shift over time. You need to systematically diagnose and fix the issue.
Task
Describe a step-by-step process to debug this accuracy drop, including:
-
How you would triage and prioritize (e.g., rollback, canary, guardrails).
-
The tools and logs you would inspect.
-
The metrics and statistical tests you would compute (for both data and model performance).
-
How you would isolate root cause across data, model, code/config, infra, and labels.
-
How you would validate the fix and prevent regressions.
Be specific about:
-
Data quality, drift, and schema checks.
-
Training vs. inference preprocessing parity.
-
Model registry/versioning and environment differences.
-
Label delays and evaluation correctness.
-
Offline reproduction and A/B/shadow testing strategies.