Model Evaluation and Calibration
Asked of: Data Scientist
Last updated

-
What it is Model evaluation judges how well a model makes predictions using metrics tied to the task (e.g., accuracy, AUC, log loss). Calibration checks whether predicted probabilities match reality: among items scored 0.7, roughly 70% should be correct. Good systems need both discrimination and well-calibrated confidence.
-
Why interviewers ask about it At companies like Meta, models power ranking, ads, integrity, and safety; many downstream decisions (thresholds, budgets, human review) depend on trustworthy probabilities, not just AUC. Poor calibration can quietly bleed revenue or harm user experience by over- or under-reacting to risk scores, even when headline metrics look strong.
-
Core ideas to know
- Discrimination vs calibration: AUC measures ranking; Brier/log loss and reliability diagrams assess probability quality.
- Reliability diagram: bin predictions, plot predicted vs. observed positives; diagonal means perfect calibration.
- ECE/MCE: summarize calibration gaps; estimators are biased with small data or bad binning choices.
- Proper scoring rules: log loss and Brier reward honest probabilities; optimizing them encourages calibration.
- Post-hoc fixes: Platt (sigmoid), isotonic regression, temperature scaling; fit on a held-out calibration set.
- Segment checks: evaluate by cohort (country, device, class imbalance) and under distribution shift; recalibrate periodically.
-
A common pitfall Candidates equate high AUC or F1 with trustworthy probabilities. In practice, a fraud or spam model can rank well yet be overconfident (e.g., “0.9” means only 60% correct), causing over-blocking, poor budget allocation, or alert fatigue. Another mistake is calibrating on the training/test set, which overfits and inflates confidence; always reserve a separate calibration split or use cross-validation. Finally, they ignore multiclass or multilabel nuances, where per-class calibration and top-k confidence both matter.
-
Further reading
- scikit-learn User Guide: Probability calibration — Clear, practical overview with APIs, reliability diagrams, and notes on Platt vs. isotonic methods. (scikit-learn.org)
- On Calibration of Modern Neural Networks (Guo et al., ICML 2017) — Seminal paper showing modern nets are often miscalibrated; introduces temperature scaling. (proceedings.mlr.press)
- Mitigating bias in calibration error estimation (Google Research) — Discusses pitfalls of ECE estimation and improved estimators; useful for production evaluation. (research.google)
Related concepts
- ML Model Evaluation And Calibration
- Model Evaluation, Calibration, And Thresholding
- Classifier Evaluation, Thresholding, And Calibration
- Classifier Evaluation, Calibration, And Thresholds
- Machine Learning Model Evaluation And CalibrationMachine Learning
- Classifier Evaluation, Calibration, And Thresholding