Model Evaluation and Calibration — Tech Interview Concept

What it is Model evaluation judges how well a model makes predictions using metrics tied to the task (e.g., accuracy, AUC, log loss). Calibration checks whether predicted probabilities match reality: among items scored 0.7, roughly 70% should be correct. Good systems need both discrimination and well-calibrated confidence.
Why interviewers ask about it At companies like Meta, models power ranking, ads, integrity, and safety; many downstream decisions (thresholds, budgets, human review) depend on trustworthy probabilities, not just AUC. Poor calibration can quietly bleed revenue or harm user experience by over- or under-reacting to risk scores, even when headline metrics look strong.
Core ideas to know

Discrimination vs calibration: AUC measures ranking; Brier/log loss and reliability diagrams assess probability quality.
Reliability diagram: bin predictions, plot predicted vs. observed positives; diagonal means perfect calibration.
ECE/MCE: summarize calibration gaps; estimators are biased with small data or bad binning choices.
Proper scoring rules: log loss and Brier reward honest probabilities; optimizing them encourages calibration.
Post-hoc fixes: Platt (sigmoid), isotonic regression, temperature scaling; fit on a held-out calibration set.
Segment checks: evaluate by cohort (country, device, class imbalance) and under distribution shift; recalibrate periodically.

A common pitfall Candidates equate high AUC or F1 with trustworthy probabilities. In practice, a fraud or spam model can rank well yet be overconfident (e.g., “0.9” means only 60% correct), causing over-blocking, poor budget allocation, or alert fatigue. Another mistake is calibrating on the training/test set, which overfits and inflates confidence; always reserve a separate calibration split or use cross-validation. Finally, they ignore multiclass or multilabel nuances, where per-class calibration and top-k confidence both matter.
Further reading

scikit-learn User Guide: Probability calibration — Clear, practical overview with APIs, reliability diagrams, and notes on Platt vs. isotonic methods. (scikit-learn.org)
On Calibration of Modern Neural Networks (Guo et al., ICML 2017) — Seminal paper showing modern nets are often miscalibrated; introduces temperature scaling. (proceedings.mlr.press)
Mitigating bias in calibration error estimation (Google Research) — Discusses pitfalls of ECE estimation and improved estimators; useful for production evaluation. (research.google)

Related concepts