Model Evaluation, Calibration, And Validation

What's being tested

These questions test a Machine Learning Engineer's ability to produce, validate, and operate probabilistic models (e.g., click-through-rate predictors) end-to-end: choosing the right evaluation metrics, preventing data leakage with proper validation splits, diagnosing miscalibration or drift, and specifying production-ready calibration and monitoring. Interviewers probe whether you can justify metric choices for downstream systems (ranking, bidding, personalization), and design offline and online validation that reflects production behavior.

Core knowledge

Probability calibration — post-hoc methods like Platt scaling (logistic) and isotonic regression map raw scores to calibrated probabilities; isotonic can overfit on small validation sets.
Log loss / cross-entropy — primary loss for probabilistic forecasts: $\text{LogLoss}=-\frac{1}{N}\sum_i[y_i\log p_i+(1-y_i)\log(1-p_i)].$ Sensitive to extreme mispredictions.
Brier score & ECE — Brier score is mean squared error of probabilities, Brier = $\frac{1}{N}\sum(p_i-y_i)^2$ . Expected Calibration Error (ECE) estimates calibration across probability bins; compute with sufficiently many samples per bin.
Ranking vs probabilistic metrics — AUC-ROC measures discriminative rank ordering; PR-AUC matters for rare positives; good AUC does not imply good calibration.
Validation splits — use time-based holdout for temporal data to avoid leakage; use user-level split (per-user aggregation) to prevent leakage across sessions. Avoid random CV for sequential click logs.
Stratified sampling & class imbalance — for rare clicks, use stratified sampling or resampling carefully; preserve base-rate when calibrating since calibration depends on class prevalence.
Offline → online parity — validate by shadowing in production and compare offline proxies (e.g., log-loss) with online metrics (CTR, revenue). Track uplift or degradation per cohort.
Feature leakage & leakage tests — check that no future-look features (e.g., next-impression timestamps) are present; run permutation importance and time-shifted feature tests.
Segmented evaluation — compute metrics by user segment, device, geography, or advertiser; calibration can vary widely across segments and needs per-segment monitoring.
Hyperparameter tuning & early stopping — tune on validation holdout; regularize properly for calibration-sensitive models; tree models like XGBoost/LightGBM often provide strong AUC but may need calibration.
Operational metrics & monitoring — monitor population calibration, per-segment ECE, prediction distribution shifts, and input feature drift; alert when drift exceeds thresholds and trigger retraining.
Uncertainty & decision thresholds — when probabilities are used for downstream thresholds (bidding, content selection), simulate expected utility under probability errors; calibrate to optimize expected downstream value.

Worked example — Build and evaluate click prediction models

Start by clarifying: "Is the model used as a probability for downstream systems (bidding/ranking) or only for ranking? What latency and storage constraints? What’s the acceptable calibration drift?" A strong answer structures itself around (1) data and validation design, (2) modeling and calibration, (3) evaluation and online validation, (4) monitoring and rollout. For data, propose a time-based training/validation/test split and a per-user split to avoid session leakage; declare how you’ll handle rare positives and cold-start users. For modeling, propose a baseline XGBoost or light neural net trained with log-loss, with early stopping on validation; then apply Platt scaling or isotonic regression using a holdout calibration set. For evaluation, pick primary metric: log-loss if probabilities matter, PR-AUC if focusing on rare-click ranking; always report AUC and ECE and show reliability diagrams. Flag a tradeoff explicitly: optimizing for AUC may worsen calibration; choose objective aligned to downstream cost. For production, recommend a shadow deployment and compare offline predictions to live CTR by cohort; close with "if I had more time, I'd run per-advertiser and per-device calibration, build automated drift detection, and A/B test decision thresholds in production."

A second angle — Model y from x and interpret distributions

When asked to pick regression vs classification and to interpret distributions, focus on framing: decide if the target is inherently a probability or a real-valued outcome (e.g., dwell time vs click). Use distributional diagnostics: KS test, Wasserstein distance, and P(Predicted|Actual) histograms to compare predicted and empirical distributions; quantify shift with KL or JS divergence. If class imbalance or heavy tails exist, prefer proper scoring rules (log-loss for probabilistic, quantile loss for tails) and consider sample weighting. Cold-start constraints change evaluation: use backoff models, evaluate cold-start cohorts separately, and report per-cohort metrics. The same calibration and time-based validation principles apply, but you’ll emphasize distributional alignment rather than just ranking performance.

Common pitfalls

Pitfall: Optimizing the wrong metric. Choosing AUC because it's high while downstream uses probabilities leads to a model that ranks well but misprices/bids; instead, align loss (log-loss) with downstream utility when probabilities feed decisions.

Pitfall: Improper validation splits. Using random CV on time-series click logs masks leakage; always simulate production ordering with time- or user-based splits and hold out a calibration set separate from validation used for hyperparameter tuning.

Pitfall: Over-calibrating on small slices. Applying isotonic regression on small per-segment samples gives perfect-looking calibration but overfits; prefer Platt scaling or hierarchical calibration with shrinkage, and validate calibration on held-out segments.

Connections

Interviewers may pivot to experiment design (A/B test metrics when model replaces ranking), feature-store design (online/offline feature parity, freshness), or model serving concerns (latency, versioning, canary rollouts). Be prepared to discuss how evaluation choices affect deployment and monitoring.

What's being tested

Core knowledge

Worked example — Build and evaluate click prediction models

A second angle — Model y from x and interpret distributions

Common pitfalls

Connections

Further reading

Practice questions

Related concepts