Model Evaluation, Calibration, And Thresholding

What's being tested

Interviewers are probing whether you can translate probabilistic model outputs into reliable product decisions under business constraints. At Meta, classifiers and ranking models drive ads delivery, feed ranking, notification sends, integrity enforcement, friend recommendations, and marketplace trust decisions, where the cost of false positives and false negatives is rarely symmetric. The key skill is not reciting AUC or precision definitions; it is choosing the right evaluation metric, diagnosing whether scores mean what they claim to mean, and setting thresholds that optimize product outcomes without harming users or violating constraints. Strong candidates connect offline metrics to online experimentation, segment-level reliability, and operational decision rules.

Core knowledge

Start with the decision context: ranking, binary action, human review queue, or probability estimate. ROC-AUC may be fine for rank ordering, but thresholded actions require precision, recall, false positive rate, expected cost, or utility. Calibrated probabilities matter when downstream systems consume scores as risks.
Confusion-matrix metrics depend on a threshold $t$ : TP, FP, TN, FN. Common formulas: $\text{precision}=TP/(TP+FP)$ , $\text{recall}=TP/(TP+FN)$ , $\text{FPR}=FP/(FP+TN)$ , $\text{F1}=2PR/(P+R)$ . Always define the positive class; in integrity settings, “positive” may mean violation.
ROC-AUC measures probability a random positive is scored above a random negative. It is insensitive to class prevalence and often overly optimistic for rare events. For spam, fraud, abuse, or click prediction with low base rates, PR-AUC and precision at fixed recall are usually more informative.
Log loss and Brier score evaluate probabilistic quality, not just ranking. Log loss is $-\frac{1}{N}\sum_i y_i\log p_i+(1-y_i)\log(1-p_i),$ heavily penalizing confident wrong predictions. Brier score is $\frac{1}{N}\sum_i(p_i-y_i)^2$ and decomposes into calibration and refinement.
Calibration means predicted probabilities match empirical frequencies: among items scored near 0.8, roughly 80% should be positive. Reliability diagrams bucket predictions, plot mean prediction vs observed rate, and compute expected calibration error: $ECE=\sum_b \frac{n_b}{N}|\text{acc}(b)-\text{conf}(b)|.$ Use enough samples per bin; sparse bins mislead.
Common calibration methods include Platt scaling/logistic calibration, isotonic regression, beta calibration, and temperature scaling for neural networks. Platt scaling is low-variance and works with limited validation data; isotonic is flexible but can overfit below roughly tens of thousands of calibration examples or in sparse segments.
Threshold choice should follow expected utility, not default to 0.5. If FP cost is $C_{FP}$ and FN cost is $C_{FN}$ , classify positive when $p > \frac{C_{FP}}{C_{FP}+C_{FN}}$ assuming calibrated probabilities and equal action benefits. In practice, constraints like review capacity, user experience, or advertiser ROI often dominate.
For capacity-constrained systems, choose thresholds by ranking scores and taking top $K$ , or set a threshold that satisfies a constraint such as $\text{precision} \ge 95\%$ , $\text{recall} \ge 80\%$ , or review volume $\le 50{,}000$ /day. This is common for integrity queues and support triage.
Evaluate by slice: geography, language, device, age cohort, new vs established users, content type, advertiser vertical, and traffic source. Aggregate AUC can improve while calibration or false positive rates worsen for small but important groups. Meta interviewers often reward explicit slice analysis.
Offline validation must respect time and leakage. Use temporal splits for feed, ads, and integrity models because behavior drifts and labels arrive with delay. Avoid using post-treatment features, future engagement, reviewer decisions caused by the model, or duplicate near-identical examples across train/test.
Online evaluation requires experimentation. A model with better log loss can hurt long-term engagement, creator satisfaction, or ad marketplace dynamics. Use A/B tests with guardrails: latency, complaint rate, hide/report rate, revenue, retention, diversity, fairness proxies, and ecosystem health metrics.
Calibration can drift after launch due to distribution shift, feedback loops, or policy changes. Monitor population stability index, score distribution shifts, calibration by decile, realized base rate, and threshold-triggered action rates. Recalibrate periodically using recent labels, but avoid chasing noisy short-term fluctuations.

Worked example

Evaluate and choose a threshold for a notification classifier

A strong candidate would first clarify the objective: are we predicting probability of click, probability of meaningful engagement, or probability the user will find the notification valuable? They would also ask what the action is: send versus suppress, rank within a notification tray, or allocate a limited daily notification budget. In the first 30 seconds, they might state: “I’ll assume this is a binary classifier producing a probability of positive user engagement, and we need to decide when to send while protecting user experience.”

The answer should then be organized around four pillars: offline evaluation, calibration, threshold selection, and online validation. Offline, they would compare models using PR-AUC, log loss, calibration curves, and precision/recall at candidate thresholds, especially because notification response is likely low base-rate. For calibration, they would check whether users scored at 0.2 actually engage about 20% of the time, and inspect reliability separately by user activity level, country, device, and notification type. For thresholding, they would avoid saying “use 0.5” and instead define a utility function: benefit of engagement minus cost of annoyance, opt-outs, muted notifications, or downstream churn.

One tradeoff to flag explicitly is short-term clicks versus long-term user fatigue. A lower threshold may increase immediate CTR but also raise notification disables or reduce session quality, so the launch decision should include guardrails such as unsubscribe rate, complaint rate, 7-day retention, and total notification volume per user. The candidate should close by saying that if they had more time, they would run an A/B test with ramped exposure, monitor calibration drift after launch, and consider personalized thresholds based on user-level tolerance or historical responsiveness.

A second angle

Assess whether a feed ranking model is well calibrated

The same ideas apply, but the framing changes because feed ranking is often more about ordering candidates than making one binary send/suppress decision. AUC or NDCG may tell you whether engaging posts are ranked above less engaging posts, while calibration tells you whether the predicted probability of comment, share, or hide is numerically trustworthy. Calibration matters if scores are combined with other objectives, such as integrity penalties, creator value, or ad load, because a miscalibrated model can dominate or underweight the final ranking formula. Unlike a notification threshold, the “decision boundary” may be implicit in a multi-objective ranking system, so the candidate should discuss score normalization, per-surface calibration, and online metrics like session depth, meaningful interactions, hides, and long-term retention.

Common pitfalls

Analytical mistake: optimizing the wrong metric.
A tempting answer is “use accuracy” or “pick the model with highest AUC.” Accuracy fails under class imbalance, and AUC can hide poor precision at the operating point. A stronger answer names the business action first, then selects metrics such as precision at fixed recall, recall at fixed FPR, log loss, calibration error, or expected utility.

Communication mistake: skipping assumptions.
Candidates often jump into formulas without defining the positive class, label source, base rate, or cost asymmetry. In a Meta-style product interview, it lands better to say, “I’m assuming false positives mean we incorrectly take an action against good content, which is more costly than missing some bad content,” then adapt the thresholding strategy accordingly.

Depth mistake: treating calibration as a single global property.
Saying “I’ll plot a calibration curve” is not enough. Good answers mention calibration by decile and by important slices, label delay, distribution shift, and the possibility that a model is well calibrated globally but badly miscalibrated for low-resource languages, new users, or rare content categories.

Connections

Interviewers may pivot from this topic into experimentation, especially how to validate offline model gains with A/B tests and guardrail metrics. They may also ask about class imbalance, ranking metrics such as NDCG/MAP, causal effects of launching a model, fairness across user groups, or monitoring ML systems under distribution shift.