Classifier Evaluation, Calibration, And Thresholds
Asked of: Data Scientist
Last updated

What's being tested
Ability to evaluate classifier quality beyond raw accuracy: interpreting confusion matrices, selecting thresholds for business utility, diagnosing and fixing probability miscalibration, and reasoning about trade-offs under class imbalance.
Core knowledge
- Confusion matrix basics: TP, FP, TN, FN and derived metrics (precision, recall, specificity).
- ROC AUC vs PR AUC: ROC for balanced classes, PR for imbalanced positive-focused tasks.
- Calibration: reliability diagram, Brier score, and proper scoring rules (log-loss).
- Calibration fixes: Platt scaling (sigmoid), isotonic regression, temperature scaling for neural nets.
- Threshold selection: maximize expected utility, use cost matrix, Youden's J, or optimize business KPIs on validation.
- Validation practices: calibrate and pick thresholds on held-out data or nested CV to avoid leakage.
- Monitor drift: post-deployment calibration/threshold decay under distribution shift.
Worked example — "How do you choose a decision threshold for a classifier in production?"
First ask: what is the business objective and the relative cost of false positives vs false negatives (cost matrix or KPI like revenue per true positive). Next, verify predicted probabilities are calibrated; if not, apply Platt scaling, isotonic regression, or temperature scaling using a validation set. Then sweep thresholds on calibrated probabilities to compute expected utility (or precision/recall at different cuts) and pick the threshold that maximizes business metric on the validation set. Finally, plan monitoring: track calibration metrics and KPI drift and re-tune threshold periodically or under distribution shift.
A common pitfall
Picking the threshold that maximizes F1 or accuracy on the training set without checking calibration or class imbalance. That often overfits, ignores asymmetric costs, and yields poor business outcomes. Similarly, tuning thresholds using the test set leaks information and invalidates performance estimates.
Further reading
- Niculescu-Mizil & Caruana, "Predicting Good Probabilities" (ICML 2005) — classic on calibration methods.
- Guo et al., "On Calibration of Modern Neural Networks" (ICML 2017) — temperature scaling and calibration diagnostics.