Model Evaluation, Calibration, And Validation
Asked of: Machine Learning Engineer
Last updated
What's being tested
These questions test a Machine Learning Engineer's ability to produce, validate, and operate probabilistic models (e.g., click-through-rate predictors) end-to-end: choosing the right evaluation metrics, preventing data leakage with proper validation splits, diagnosing miscalibration or drift, and specifying production-ready calibration and monitoring. Interviewers probe whether you can justify metric choices for downstream systems (ranking, bidding, personalization), and design offline and online validation that reflects production behavior.
Core knowledge
-
Probability calibration — post-hoc methods like Platt scaling (logistic) and isotonic regression map raw scores to calibrated probabilities; isotonic can overfit on small validation sets.
-
Log loss / cross-entropy — primary loss for probabilistic forecasts: Sensitive to extreme mispredictions.
-
Brier score & ECE — Brier score is mean squared error of probabilities, Brier = . Expected Calibration Error (ECE) estimates calibration across probability bins; compute with sufficiently many samples per bin.
-
Ranking vs probabilistic metrics — AUC-ROC measures discriminative rank ordering; PR-AUC matters for rare positives; good AUC does not imply good calibration.
-
Validation splits — use time-based holdout for temporal data to avoid leakage; use user-level split (per-user aggregation) to prevent leakage across sessions. Avoid random CV for sequential click logs.
-
Stratified sampling & class imbalance — for rare clicks, use stratified sampling or resampling carefully; preserve base-rate when calibrating since calibration depends on class prevalence.
-
Offline → online parity — validate by shadowing in production and compare offline proxies (e.g., log-loss) with online metrics (CTR, revenue). Track uplift or degradation per cohort.
-
Feature leakage & leakage tests — check that no future-look features (e.g., next-impression timestamps) are present; run permutation importance and time-shifted feature tests.
-
Segmented evaluation — compute metrics by user segment, device, geography, or advertiser; calibration can vary widely across segments and needs per-segment monitoring.
-
Hyperparameter tuning & early stopping — tune on validation holdout; regularize properly for calibration-sensitive models; tree models like
XGBoost/LightGBMoften provide strong AUC but may need calibration. -
Operational metrics & monitoring — monitor population calibration, per-segment ECE, prediction distribution shifts, and input feature drift; alert when drift exceeds thresholds and trigger retraining.
-
Uncertainty & decision thresholds — when probabilities are used for downstream thresholds (bidding, content selection), simulate expected utility under probability errors; calibrate to optimize expected downstream value.
Worked example — Build and evaluate click prediction models
Start by clarifying: "Is the model used as a probability for downstream systems (bidding/ranking) or only for ranking? What latency and storage constraints? What’s the acceptable calibration drift?" A strong answer structures itself around (1) data and validation design, (2) modeling and calibration, (3) evaluation and online validation, (4) monitoring and rollout. For data, propose a time-based training/validation/test split and a per-user split to avoid session leakage; declare how you’ll handle rare positives and cold-start users. For modeling, propose a baseline XGBoost or light neural net trained with log-loss, with early stopping on validation; then apply Platt scaling or isotonic regression using a holdout calibration set. For evaluation, pick primary metric: log-loss if probabilities matter, PR-AUC if focusing on rare-click ranking; always report AUC and ECE and show reliability diagrams. Flag a tradeoff explicitly: optimizing for AUC may worsen calibration; choose objective aligned to downstream cost. For production, recommend a shadow deployment and compare offline predictions to live CTR by cohort; close with "if I had more time, I'd run per-advertiser and per-device calibration, build automated drift detection, and A/B test decision thresholds in production."
A second angle — Model y from x and interpret distributions
When asked to pick regression vs classification and to interpret distributions, focus on framing: decide if the target is inherently a probability or a real-valued outcome (e.g., dwell time vs click). Use distributional diagnostics: KS test, Wasserstein distance, and P(Predicted|Actual) histograms to compare predicted and empirical distributions; quantify shift with KL or JS divergence. If class imbalance or heavy tails exist, prefer proper scoring rules (log-loss for probabilistic, quantile loss for tails) and consider sample weighting. Cold-start constraints change evaluation: use backoff models, evaluate cold-start cohorts separately, and report per-cohort metrics. The same calibration and time-based validation principles apply, but you’ll emphasize distributional alignment rather than just ranking performance.
Common pitfalls
Pitfall: Optimizing the wrong metric. Choosing AUC because it's high while downstream uses probabilities leads to a model that ranks well but misprices/bids; instead, align loss (log-loss) with downstream utility when probabilities feed decisions.
Pitfall: Improper validation splits. Using random CV on time-series click logs masks leakage; always simulate production ordering with time- or user-based splits and hold out a calibration set separate from validation used for hyperparameter tuning.
Pitfall: Over-calibrating on small slices. Applying isotonic regression on small per-segment samples gives perfect-looking calibration but overfits; prefer Platt scaling or hierarchical calibration with shrinkage, and validate calibration on held-out segments.
Connections
Interviewers may pivot to experiment design (A/B test metrics when model replaces ranking), feature-store design (online/offline feature parity, freshness), or model serving concerns (latency, versioning, canary rollouts). Be prepared to discuss how evaluation choices affect deployment and monitoring.
Further reading
-
Predicting Good Probabilities with Supervised Learning — Niculescu-Mizil & Caruana (2005) — empirical comparison of calibration methods.
-
On Calibration of Modern Neural Networks — Guo et al. (2017) — diagnosis and remedies for deep-net miscalibration.
Practice questions
Related concepts
- Machine Learning Model Evaluation And CalibrationMachine Learning
- ML Model Evaluation, Metrics, And ExperimentationML System Design
- Model Evaluation and Calibration
- ML Model Evaluation And Calibration
- LLM Evaluation, Offline Metrics, Online Monitoring, and Regression Testing
- Classifier Evaluation, Calibration, And Thresholding