ML Model Evaluation And Calibration

What's being tested

Interviewers are probing whether you can evaluate a predictive model in a way that matches the product decision it will drive, not whether you can recite metric definitions. At Meta, models often rank Feed stories, ads, notifications, integrity actions, or recommendations, so “good” can mean accurate probabilities, strong ranking, improved downstream engagement, reduced harm, or stable performance across user segments. You need to reason from offline metrics to online impact, identify when a model is miscalibrated, and explain how you would debug metric movement across cohorts, time, and thresholds. A strong answer connects statistical evaluation, business tradeoffs, experimentation, and production monitoring.

Core knowledge

Distinguish discrimination from calibration. Discrimination asks whether positives rank above negatives, often measured by ROC-AUC or PR-AUC. Calibration asks whether predicted probabilities match empirical frequencies: among items scored 0.7, roughly 70% should be positive. Meta ranking systems may need both: ranking for ordering, calibration for bidding, budgeting, or expected value decisions.
Core binary classification metrics come from the confusion matrix: precision $= TP/(TP+FP)$ , recall $= TP/(TP+FN)$ , false positive rate $= FP/(FP+TN)$ , and $F_1 = 2PR/(P+R)$ . Threshold-dependent metrics are useful when an action has a fixed operating point, such as suppressing content or sending a notification.
ROC-AUC estimates $P(s(x^+) > s(x^-))$ and is threshold-independent, but it can look strong under extreme class imbalance while precision is poor. For rare events like abuse, fraud, or conversion, PR-AUC and precision@k are often more actionable because they focus on the positive class and top-ranked decisions.
Log loss evaluates probabilistic accuracy:
$-\frac{1}{n}\sum_i \left[y_i\log p_i + (1-y_i)\log(1-p_i)\right].$
It heavily penalizes confident wrong predictions, making it useful when probabilities feed expected value calculations. Brier score, $\frac{1}{n}\sum_i (p_i-y_i)^2$ , is less harsh and decomposes into calibration and refinement components.
Calibration is commonly inspected with reliability diagrams, where predictions are bucketed and compared to observed outcome rates. Expected Calibration Error is often approximated as
$ECE = \sum_{b=1}^B \frac{n_b}{n} |\text{acc}(b)-\text{conf}(b)|.$
ECE depends heavily on binning; sparse bins, especially near 0 or 1, can create noisy conclusions.
Standard calibration methods include Platt scaling / logistic calibration, isotonic regression, temperature scaling, and sometimes beta calibration. Platt scaling is low-variance and works well with limited calibration data; isotonic is more flexible but can overfit if each score region has few observations. Temperature scaling is common for neural network multiclass calibration.
Calibration must be evaluated on held-out data that reflects deployment traffic. If the model was trained on one distribution and evaluated on another, miscalibration may come from covariate shift, label shift, delayed labels, or logging-policy bias. For Meta-style systems, segment calibration by country, language, device, age cohort, traffic source, and new versus returning users is often essential.
For ranking products, evaluate at the decision surface: precision@k, recall@k, NDCG, MAP, MRR, or lift in the top decile may matter more than global AUC. A model can improve AUC by better ordering easy negatives while worsening the top 1% of candidates where the product actually acts.
Threshold selection should reflect costs and benefits. If benefit of a true positive is $B$ and cost of a false positive is $C$ , act when $pB > (1-p)C$ , or $p > C/(B+C)$ under a simple decision model. In practice, thresholds may also be constrained by review capacity, user experience caps, fairness requirements, or latency budgets.
Offline evaluation should use time-based splits when labels, behavior, or ranking policies evolve. Random splits can leak future information through user history, item popularity, or repeated entities. For high-volume systems, exact AUC requires sorting scores, $O(n \log n)$ ; beyond tens or hundreds of millions of examples, teams often use distributed sorting or histogram/quantile approximations.
Online validation is required because offline metrics may not capture feedback loops, interference, novelty effects, or user adaptation. A model that improves predicted click-through rate may reduce long-term retention or content quality. Meta interviewers expect guardrails: latency, crashes, spam reports, hides, unsubscribes, session quality, fairness, and ecosystem health metrics.
Monitor post-launch calibration and performance drift. Track prediction distributions, label rates, calibration by score bucket, AUC/PR-AUC, feature missingness, and segment-level deltas. Alerting only on average loss can miss localized failures, such as a locale-specific feature pipeline break or degraded performance for new users.

Worked example

Example question: “How would you evaluate and calibrate a model that predicts whether a user will click on a notification?”

In the first 30 seconds, a strong candidate would clarify the decision: is the score used to rank notifications, decide whether to send one, estimate expected engagement, or allocate a limited notification budget? They would also ask about label definition, delayed clicks, negative examples, and whether the objective is short-term CTR, long-term retention, or reducing notification fatigue. The answer should be organized around four pillars: offline evaluation, calibration, threshold/business decisioning, and online experimentation. Offline, they would compare ROC-AUC, PR-AUC, log loss, calibration plots, and precision/recall at operational send thresholds, using a time-based holdout and segment cuts such as country, platform, active-user tenure, and notification type. For calibration, they would bucket predicted probabilities, compare predicted versus observed CTR, and apply Platt scaling or isotonic regression on a separate calibration set if the scores are systematically over- or under-confident.

A key tradeoff to flag is that optimizing CTR may increase annoying notifications and hurt long-term engagement, so the launch decision should include guardrails like disables, hides, session starts, unsubscribes, and retention. They should also explain that global calibration is not enough: a model can be calibrated overall but badly miscalibrated for low-activity users or specific notification categories. Online, they would propose an A/B test with user-level randomization, primary metric aligned to product goals, and guardrails for negative user experience. A good close would be: “If I had more time, I’d also evaluate delayed-label handling, repeated-exposure effects, and whether the model remains calibrated after the sending policy changes.”

A second angle

Example question: “A model’s offline AUC improved, but the product metric got worse after launch. How would you investigate?”

The same concepts apply, but the framing shifts from choosing metrics to diagnosing an offline-online mismatch. You would first ask whether the offline test set matched launch traffic and whether the model was evaluated on the same candidates and action thresholds used in production. Then you would check top-k and threshold-specific metrics, not just AUC, because product impact may depend only on the highest-scored items. Calibration is a likely culprit if downstream systems interpret scores as probabilities, for example in bidding, notification budgeting, or ranking score blending. You would also investigate segment regressions, logging bugs, feature skew, delayed labels, and whether the new model changed user behavior enough to invalidate the offline assumptions.

Common pitfalls

Analytical mistake: treating AUC as the universal answer. A tempting answer is “I’d use AUC because it is threshold-independent,” but that misses business constraints. A better answer explains when AUC is useful, then adds PR-AUC for rare positives, log loss/Brier score for probabilities, and precision/recall or utility at the actual operating threshold.

Communication mistake: jumping into formulas before defining the product decision. Interviewers want to see that you understand what the model is for. Start by asking whether the score is used for ranking, binary action, expected value estimation, or monitoring; then choose metrics that match that use case.

Depth mistake: saying “calibrate with isotonic regression” without discussing data and deployment constraints. Isotonic regression can overfit small calibration sets and may fail under distribution shift. A stronger answer compares Platt scaling, isotonic regression, and temperature scaling, uses a separate calibration set, and checks calibration by segment after deployment.

Connections

Interviewers may pivot from model evaluation into A/B testing, causal inference, ranking/recommendation metrics, or fairness and segment analysis. If the model affects user exposure or creates feedback loops, expect follow-ups on experiment design, interference, counterfactual evaluation, delayed labels, or long-term metric tradeoffs.