ML Evaluation, Uncertainty, And Safety Guardrails

What's being tested

Candidates must demonstrate practical mastery of model evaluation, uncertainty estimation, and operational safety guardrails for deployed ML systems. Interviewers look for an ability to measure and monitor correctness and confidence (calibration, error types, drift), design abstention/fallback policies, and choose evaluation protocols that surface real-world failure modes. The focus is on ML Engineer responsibilities: training/evaluation methods, deployment-time decision rules, and monitoring signals—not product strategy or low-level infra plumbing.

Core knowledge

Calibration: difference between accuracy and confidence; metrics include Expected Calibration Error (ECE) and Brier score. ECE = $\sum_{k=1}^K \frac{|B_k|}{n} |\text{acc}(B_k)-\text{conf}(B_k)|$ ; lower is better.
Uncertainty types: aleatoric (data noise) vs epistemic (model uncertainty); aleatoric irreducible, epistemic reduced with more data or ensembles.
Out-of-distribution (OOD) detection: methods include density-estimation, Mahalanobis distance, max softmax score, and using a separate OOD classifier; monitor OOD rate as a primary safety metric.
Selective prediction / abstention: define a confidence threshold or use a learned rejector; tradeoff between coverage and risk; optimize for a utility like maximize accuracy s.t. coverage ≥ c.
Calibration fixes: temperature scaling, isotonic regression, Platt scaling for probabilistic outputs; temperature scaling is a single-parameter post-hoc softmax rescaling (cheap, preserves accuracy).
Bayesian & approximate methods: ensembles, MC dropout, and deep Gaussian processes approximate epistemic uncertainty; ensembles often work best in practice for calibration and robustness.
Conformal prediction: distribution-free way to produce valid prediction sets with marginal coverage guarantees; useful for calibrated abstention in production.
Multi-annotator label models: Dawid–Skene EM or hierarchical Bayesian confusion matrices to infer true labels and annotator reliabilities; use per-example posterior uncertainty in training/validation.
Evaluation design: use time-based splits for temporal drift, stratified splits for class imbalance, and separate challenge / red-team datasets (adversarial, OOD, edge-cases) to estimate worst-case behavior.
RAG-specific evaluation: measure retrieval metrics (recall@k, MRR) and downstream faithfulness (answer grounding rate, hallucination rate via human eval or automated entailment checks); monitor latency-accuracy tradeoffs for reranker depth.
Operational metrics & alerts: track calibration drift, OOD fraction, abstention rate, false-positive safety triggers, and label-distribution shifts (KL divergence). Set SLOs and automated alerts when key metrics cross thresholds.

Tip: always report both point-estimate metrics (accuracy, ROC AUC) and uncertainty-aware metrics (ECE, Brier, coverage-vs-risk curves) when proposing a deployment policy.

Worked example — Design a chatbot fallback for unknown questions

First 30 seconds: clarify intended user experience (graceful degradation vs strict refusal), latency SLO, and whether a human-in-loop exists. Frame the solution around three pillars: (1) Detection — compute model confidence and OOD score (e.g., max-softmax + dedicated OOD model); (2) Policy — define thresholds for auto-reply, reroute to retrieval or tools, or escalate to human; (3) Monitoring & feedback — log unknowns for retraining and measure coverage vs error. Walk through a concrete policy: if confidence < 0.6 or OOD probability > 0.5 → run a fallback pipeline: try RAG retrieval to find grounding, then if still low confidence → show a refusal template or route to human. A key tradeoff to flag is false-positive refusals (user-visible friction) vs false-negative hallucinations (safety risk); tuning thresholds requires business-aligned loss. Close by saying: if I had more time, I'd prototype on historical logs to set thresholds, add conformal prediction for calibrated sets, and instrument a human-feedback loop for fast labelled examples.

A second angle — Improve classifier with noisy multi-annotator labels

Same evaluation and uncertainty principles apply but shift focus to label uncertainty and dataset curation. Start by modeling annotator behavior with Dawid–Skene or hierarchical Bayesian confusion matrices to produce posterior label probabilities rather than hard labels. Train models on soft labels (label posteriors) and propagate annotator uncertainty into calibration objectives or loss weighting. Use cross-validated held-out annotator reliability checks and measure model calibration conditional on annotator certainty. For deployment, abstain on low-confidence predictions that coincide with high annotator disagreement and prioritize collecting high-value labels from reliable annotators to reduce epistemic uncertainty.

Common pitfalls

Pitfall: Conflating calibration with accuracy. A model can be high-accuracy but poorly calibrated — reporting only accuracy or AUC hides overconfident errors that cause user-visible failures. Always surface calibration metrics and reliability diagrams.

Pitfall: Designing a "fallback" without measurable SLOs. Saying "send to human" or "refuse" is insufficient; quantify acceptable coverage, latency, and human cost, and show how thresholds map to those SLOs.

Pitfall: Ignoring label noise or distributional shift during evaluation. Validating on IID holdouts when real traffic drifts will understate failure rates — include OOD/challenge sets and simulate temporal splits.

Connections

Interviewers may pivot to model interpretability (feature attribution used to triage uncertain predictions), data labeling workflows (active learning to reduce epistemic uncertainty), or adversarial robustness (attacks that artificially push inputs OOD). Being able to connect uncertainty signals to retraining and data collection plans is valuable.

What's being tested

Core knowledge

Worked example — Design a chatbot fallback for unknown questions

A second angle — Improve classifier with noisy multi-annotator labels

Common pitfalls

Connections

Further reading

Practice questions

Related concepts