Machine Learning Model Evaluation And Calibration

What's being tested

Interviewers are probing whether you can evaluate an ML model as a decision system, not just report a generic score like accuracy. For a Data Scientist at Meta, this means tying offline metrics, probability calibration, threshold choice, ranking quality, and online experiment outcomes to product and integrity tradeoffs. You’re expected to reason about class imbalance, label quality, cost-sensitive errors, data leakage, and whether a model improvement is real enough to ship. Strong answers separate model discrimination, calibration, business impact, and experiment design instead of blending them into one vague “performance” claim.

Core knowledge

Confusion-matrix metrics are threshold-dependent: precision = TP / (TP + FP), recall = TP / (TP + FN), FPR = FP / (FP + TN), and FNR = FN / (FN + TP). In fraud, abuse, and harmful content, accuracy is often misleading because negatives dominate.
Cost-sensitive evaluation converts classification errors into expected value. If false positives cost $C_{FP}$ and false negatives cost $C_{FN}$ , choose a threshold that minimizes $C_{FP} \cdot FP + C_{FN} \cdot FN$ or maximizes expected utility. This is often more defensible than optimizing F1.
Precision-recall tradeoffs matter more than ROC-AUC under severe class imbalance. ROC-AUC can look strong when positives are rare because true negatives dominate; PR-AUC, precision at a fixed recall, or recall at a fixed precision usually better reflects moderation, fraud, and ranking surfaces.
Calibration asks whether predicted probabilities match empirical frequencies: among items scored near 0.8, about 80% should be positive. Common diagnostics include calibration curves, Brier score $\frac{1}{N}\sum_i (p_i-y_i)^2$ , expected calibration error ECE, and segment-level calibration by geography, device, content type, or user cohort.
Discrimination and calibration are different. A model can rank examples well with high AUC but produce probabilities that are too extreme or too conservative. Threshold-based decisions, expected loss calculations, and human-review queues require calibrated probabilities, not just good ordering.
Threshold selection should be tied to the operating constraint: fixed review capacity, maximum false-positive rate, minimum recall for high-severity content, or expected profit. For review queues, rank by expected harm or expected value and choose the top $K$ cases rather than using one global threshold blindly.
Ranking evaluation uses position-aware metrics. Precision@K measures how many top results are relevant, Recall@K measures coverage of all relevant items, NDCG@K discounts lower-ranked relevance using $DCG@K=\sum_{i=1}^{K}\frac{rel_i}{\log_2(i+1)}$ and normalizes by the ideal ranking. For feed or product ranking, top positions usually matter most.
Offline validation design must match deployment. Use time-aware splits when features or labels evolve over time, especially for fraud, ads, and content systems. Random splits can inflate performance if the same user, content cluster, campaign, or near-duplicate appears in both train and test.
Data leakage occurs when features contain information unavailable at prediction time, such as future clicks, post-outcome moderation actions, chargeback status, or aggregates computed using the full dataset. A strong answer names the prediction timestamp and audits every feature against “would I know this then?”
Label quality shapes the ceiling of evaluation. Harmful content and fraud labels can be delayed, noisy, policy-dependent, or biased toward reviewed examples. Evaluate on adjudicated labels where possible, track inter-rater agreement, and be explicit about censoring, delayed positives, and selection bias in labeled data.
Online experiments are needed because offline gains may not translate to user impact. A DS should propose an A/B test or staged rollout with primary metrics, guardrails, power analysis, ramp criteria, and segment checks. For risky systems, include shadow evaluation or human-review auditing before full launch.
Segment analysis prevents average-metric failures. Report metrics by country, language, device, user tenure, content type, traffic source, and severity bucket. A model that improves global PR-AUC but doubles false positives for a vulnerable cohort may be unacceptable even if the aggregate result looks positive.

Worked example

For Evaluate fraud classifier with cost-sensitive metrics, start by clarifying the decision being made: are we blocking transactions, sending them to manual review, adding friction, or only ranking risk for downstream action? Then ask for the base fraud rate, false-positive cost, false-negative cost, review capacity, label delay, and whether the model score is intended to be a calibrated probability. A strong answer would organize around four pillars: offline discrimination metrics like ROC-AUC and PR-AUC, cost-sensitive thresholding, calibration and segment reliability, and safe online validation.

You might say: “Because fraud is rare, I would not lead with accuracy; I’d inspect PR-AUC, precision at operational recall levels, and expected dollar loss across thresholds.” Then compute or describe an expected-cost curve: for each threshold, estimate prevented fraud value, customer friction cost, manual review cost, and residual fraud loss. Calibration matters because if a score of 0.7 does not correspond to 70% fraud probability, expected-value decisions will be wrong even if ranking is decent.

One tradeoff to flag explicitly is between blocking more fraud and harming legitimate users: a high-recall threshold may reduce chargebacks but create unacceptable false positives for high-value or new users. You should also mention time-based validation because fraud patterns adapt and random cross-validation can overstate performance. Close by saying that, if you had more time, you would evaluate delayed labels, inspect top false positives/false negatives, run segment calibration, and propose a ramped experiment with guardrails like legitimate-user conversion, appeal rate, and customer-support contacts.

A second angle

For Evaluate and Experiment with Harmful Content Detection Model, the same evaluation ideas apply, but the error costs are less easily expressed in dollars and may vary by content severity. Instead of a single global threshold, you may need severity-specific operating points: very high recall for imminent harm, higher precision for borderline policy categories, and a review-queue threshold based on moderator capacity. Calibration still matters, but labels may be noisier because policy interpretation changes and reviewers disagree. The online test also needs stronger guardrails: prevalence of harmful views, false-positive takedowns, creator appeals, user reports, and quality metrics for unaffected content. A good DS answer balances safety, fairness, and product experience rather than optimizing one scalar metric.

Common pitfalls

Pitfall: Optimizing accuracy or generic F1 without tying the metric to the decision.

This is the classic analytical mistake. If positives are 0.1% of traffic, a model that predicts “not fraud” for everything has 99.9% accuracy and zero product value. A stronger answer defines the operating objective first, then picks PR-AUC, precision at recall, recall at fixed false-positive rate, NDCG@K, or expected cost as appropriate.

Pitfall: Treating offline model improvement as sufficient evidence to ship.

Saying “the new model has higher AUC, so launch it” misses calibration, segment regressions, label leakage, and online behavior change. Better communication is: “Offline metrics justify moving to a controlled experiment, but launch depends on primary metric lift, guardrails, and pre-defined ramp criteria.”

Pitfall: Mentioning calibration without explaining how it affects decisions.

A shallow answer says “I’d check calibration” and stops there. A deeper answer explains that calibrated probabilities enable thresholding by expected value, fair comparison across cohorts, and stable review queues; then names diagnostics like reliability plots, Brier score, ECE, and calibration by segment.

Connections

Interviewers often pivot from model evaluation to experiment design, especially power, guardrail metrics, heterogeneous treatment effects, and launch decisions. They may also probe causal inference, ranking metrics, label bias, or metric design for recommender and integrity systems. Be ready to explain how offline validation, online experimentation, and product impact fit together.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts