Classifier Evaluation, Thresholding, And Calibration

What's being tested

Interviewers are probing whether you can evaluate probabilistic classifiers in a product context where actions have asymmetric costs: hiding a good post, sending an irrelevant notification, missing spam, or incorrectly flagging an account. The skill is not reciting ROC-AUC or precision definitions, but choosing metrics, thresholds, and calibration checks that match the business decision and user harm. Meta cares because many ranking, integrity, ads, notification, and recommendation systems turn model scores into product actions under capacity, fairness, and user-experience constraints. A strong answer connects offline metrics to online impact, explains tradeoffs across segments, and anticipates distribution shift.

Core knowledge

Start from the confusion matrix at threshold $t$ : TP, FP, TN, FN. Precision $= \frac{TP}{TP+FP}$ , recall/TPR $= \frac{TP}{TP+FN}$ , FPR $= \frac{FP}{FP+TN}$ , specificity $= \frac{TN}{TN+FP}$ . Most threshold questions reduce to choosing which error is costlier.
ROC-AUC measures probability a random positive is scored above a random negative. It is threshold-independent and insensitive to class prevalence, which is useful for model ranking but can be misleading for rare-event systems like spam, fraud, harmful content, or notification clicks.
Precision-recall curves are often better for imbalanced classification. When positives are rare, a small FPR can still create many false positives. Average precision or PR-AUC focuses attention on the quality of predicted positives, which often matters for moderation queues and limited intervention capacity.
Thresholds should be chosen by expected utility when costs are known: predict positive if $p(y=1 \mid x) \cdot B_{TP} + (1-p)\cdot C_{FP} > p\cdot C_{FN} + (1-p)\cdot B_{TN}.$ In the simple case with calibrated probabilities and FP/FN costs, act when $p > \frac{C_{FP}}{C_{FP}+C_{FN}}$ .
Calibration means predicted probabilities match empirical frequencies: among examples scored 0.8, about 80% should be positive. Common checks include reliability diagrams, Brier score $\frac{1}{n}\sum_i (p_i-y_i)^2$ , log loss, expected calibration error, and calibration by segment.
Discrimination and calibration are different. A model can rank examples well but be poorly calibrated, or be calibrated but weak at ranking. ROC-AUC evaluates ordering; log loss and Brier score evaluate probabilistic quality. Thresholding based on business cost requires reasonably calibrated scores.
Common calibration methods include Platt scaling/logistic calibration, isotonic regression, beta calibration, and temperature scaling for neural nets. Isotonic is flexible but can overfit with small validation sets; Platt is lower variance. Always fit calibration on held-out data, not the training set.
Threshold optimization can be done by sorting scores and sweeping all candidate cutoffs in $O(n \log n)$ time. For very large datasets, approximate with score histograms or quantile buckets; for hundreds of millions of rows, distributed aggregation by score bins is usually sufficient.
Capacity constraints change the problem from “choose probability threshold” to “choose top $K$ .” For example, if human reviewers can inspect 50k items/day, select the threshold that yields 50k predicted positives, then evaluate precision@K, recall@K, and marginal value at the cutoff.
Base-rate shift breaks fixed thresholds. If prevalence changes across country, language, seasonality, or product surface, precision at a fixed score changes even if ranking quality is stable. Monitor calibration and confusion-matrix metrics by segment, not only globally.
Beware label quality and delayed labels. Click prediction, abuse detection, and churn models often have biased or incomplete ground truth. Negative labels may mean “not observed yet,” and moderation labels may depend on reviewer policy, sampling, or previous model decisions.
Offline threshold choice should be validated online. Changing a threshold affects user behavior, creator incentives, reviewer load, and feedback loops. Use A/B tests or interleaving where possible, and guardrail metrics such as user reports, session length quality, complaints, false appeal rate, and latency.

Worked example

“How Would You Set the Threshold for Sending Notifications?”

A strong candidate would first clarify the objective: are we optimizing clicks, long-term retention, meaningful interactions, or minimizing notification fatigue? They would also ask whether the classifier output is calibrated probability of click, probability of downstream engagement, or a ranking score, because thresholding an uncalibrated score is different from thresholding a probability. The answer can be organized around four pillars: define the action and cost matrix, evaluate the model offline, choose a threshold under constraints, and validate online with guardrails.

For offline evaluation, they would inspect PR curves, precision/recall at candidate thresholds, calibration plots, and segment-level performance by country, device, age cohort, notification type, and historical engagement. For threshold choice, they might maximize expected utility: send if expected incremental value from engagement exceeds expected cost from annoyance, unsubscribe risk, or notification disablement. If Meta imposes a daily notification budget, the candidate should switch from a global probability threshold to top- $K$ or per-user frequency-capped ranking.

One explicit tradeoff is short-term clicks versus long-term user trust: a lower threshold may raise CTR but increase opt-outs or reduce future engagement. A strong close would say: “I’d launch with a conservative threshold based on offline utility and capacity, run an A/B test, monitor incremental engagement and fatigue guardrails, and recalibrate or personalize thresholds if segments respond differently.” If given more time, they could add causal measurement, because observed clicks do not necessarily equal incremental value; some users would have opened the app anyway.

A second angle

“How Would You Evaluate a Spam Classifier?”

The same ideas apply, but the cost asymmetry is different: missing spam creates user harm, while falsely removing legitimate content creates creator harm and trust issues. Instead of optimizing CTR or utility from engagement, the focus is usually high recall at an acceptable false-positive rate, precision of enforcement actions, reviewer queue quality, and appeal overturn rate. Calibration still matters if the score triggers escalating actions such as downranking, interstitial warnings, review, or automatic removal. The threshold may also vary by action: a low threshold for sending to review, a higher threshold for reducing distribution, and a very high threshold for removal. Segment analysis is especially important because abuse patterns and label quality vary across languages, regions, and adversarial communities.

Common pitfalls

Analytical mistake: optimizing ROC-AUC when the product action happens at one operating point. A tempting answer is “I’d pick the model with higher AUC,” but two models can have the same AUC and very different precision at the threshold used in production. A stronger answer identifies the relevant operating range, such as precision@review-capacity or recall at FPR below 0.1%.

Communication mistake: giving metric definitions without tying them to user or business costs. Saying “precision is TP over predicted positives and recall is TP over actual positives” is correct but shallow. Interviewers want to hear which one matters more for the product and why: false notification versus missed engagement, false takedown versus missed abuse, false ad conversion prediction versus wasted budget.

Depth mistake: ignoring calibration and assuming scores are probabilities. Many production models output scores that rank well but are not calibrated after class weighting, negative sampling, or model retraining. If you use a score of 0.7 as “70% likely” without checking calibration, your threshold and expected utility math may be wrong; mention held-out calibration and segment-level reliability.

Connections

This topic often leads into ranking and recommendation metrics such as NDCG, MAP, precision@K, and counterfactual evaluation. Interviewers may also pivot to experimentation, including A/B testing, guardrail metrics, heterogeneous treatment effects, and long-term metric tradeoffs. For integrity or ads systems, expect follow-ups on label bias, delayed feedback, fairness across groups, and distribution shift monitoring.