Machine Learning Model Evaluation And Calibration
Asked of: Data Scientist
Last updated
What's being tested
Interviewers are probing whether you can evaluate an ML model as a decision system, not just report a generic score like accuracy. For a Data Scientist at Meta, this means tying offline metrics, probability calibration, threshold choice, ranking quality, and online experiment outcomes to product and integrity tradeoffs. You’re expected to reason about class imbalance, label quality, cost-sensitive errors, data leakage, and whether a model improvement is real enough to ship. Strong answers separate model discrimination, calibration, business impact, and experiment design instead of blending them into one vague “performance” claim.
Core knowledge
-
Confusion-matrix metrics are threshold-dependent:
precision = TP / (TP + FP),recall = TP / (TP + FN),FPR = FP / (FP + TN), andFNR = FN / (FN + TP). In fraud, abuse, and harmful content,accuracyis often misleading because negatives dominate. -
Cost-sensitive evaluation converts classification errors into expected value. If false positives cost and false negatives cost , choose a threshold that minimizes or maximizes expected utility. This is often more defensible than optimizing
F1. -
Precision-recall tradeoffs matter more than
ROC-AUCunder severe class imbalance.ROC-AUCcan look strong when positives are rare because true negatives dominate;PR-AUC, precision at a fixed recall, or recall at a fixed precision usually better reflects moderation, fraud, and ranking surfaces. -
Calibration asks whether predicted probabilities match empirical frequencies: among items scored near 0.8, about 80% should be positive. Common diagnostics include calibration curves, Brier score , expected calibration error
ECE, and segment-level calibration by geography, device, content type, or user cohort. -
Discrimination and calibration are different. A model can rank examples well with high
AUCbut produce probabilities that are too extreme or too conservative. Threshold-based decisions, expected loss calculations, and human-review queues require calibrated probabilities, not just good ordering. -
Threshold selection should be tied to the operating constraint: fixed review capacity, maximum false-positive rate, minimum recall for high-severity content, or expected profit. For review queues, rank by expected harm or expected value and choose the top cases rather than using one global threshold blindly.
-
Ranking evaluation uses position-aware metrics.
Precision@Kmeasures how many top results are relevant,Recall@Kmeasures coverage of all relevant items,NDCG@Kdiscounts lower-ranked relevance using and normalizes by the ideal ranking. For feed or product ranking, top positions usually matter most. -
Offline validation design must match deployment. Use time-aware splits when features or labels evolve over time, especially for fraud, ads, and content systems. Random splits can inflate performance if the same user, content cluster, campaign, or near-duplicate appears in both train and test.
-
Data leakage occurs when features contain information unavailable at prediction time, such as future clicks, post-outcome moderation actions, chargeback status, or aggregates computed using the full dataset. A strong answer names the prediction timestamp and audits every feature against “would I know this then?”
-
Label quality shapes the ceiling of evaluation. Harmful content and fraud labels can be delayed, noisy, policy-dependent, or biased toward reviewed examples. Evaluate on adjudicated labels where possible, track inter-rater agreement, and be explicit about censoring, delayed positives, and selection bias in labeled data.
-
Online experiments are needed because offline gains may not translate to user impact. A DS should propose an
A/B testor staged rollout with primary metrics, guardrails, power analysis, ramp criteria, and segment checks. For risky systems, include shadow evaluation or human-review auditing before full launch. -
Segment analysis prevents average-metric failures. Report metrics by country, language, device, user tenure, content type, traffic source, and severity bucket. A model that improves global
PR-AUCbut doubles false positives for a vulnerable cohort may be unacceptable even if the aggregate result looks positive.
Worked example
For Evaluate fraud classifier with cost-sensitive metrics, start by clarifying the decision being made: are we blocking transactions, sending them to manual review, adding friction, or only ranking risk for downstream action? Then ask for the base fraud rate, false-positive cost, false-negative cost, review capacity, label delay, and whether the model score is intended to be a calibrated probability. A strong answer would organize around four pillars: offline discrimination metrics like ROC-AUC and PR-AUC, cost-sensitive thresholding, calibration and segment reliability, and safe online validation.
You might say: “Because fraud is rare, I would not lead with accuracy; I’d inspect PR-AUC, precision at operational recall levels, and expected dollar loss across thresholds.” Then compute or describe an expected-cost curve: for each threshold, estimate prevented fraud value, customer friction cost, manual review cost, and residual fraud loss. Calibration matters because if a score of 0.7 does not correspond to 70% fraud probability, expected-value decisions will be wrong even if ranking is decent.
One tradeoff to flag explicitly is between blocking more fraud and harming legitimate users: a high-recall threshold may reduce chargebacks but create unacceptable false positives for high-value or new users. You should also mention time-based validation because fraud patterns adapt and random cross-validation can overstate performance. Close by saying that, if you had more time, you would evaluate delayed labels, inspect top false positives/false negatives, run segment calibration, and propose a ramped experiment with guardrails like legitimate-user conversion, appeal rate, and customer-support contacts.
A second angle
For Evaluate and Experiment with Harmful Content Detection Model, the same evaluation ideas apply, but the error costs are less easily expressed in dollars and may vary by content severity. Instead of a single global threshold, you may need severity-specific operating points: very high recall for imminent harm, higher precision for borderline policy categories, and a review-queue threshold based on moderator capacity. Calibration still matters, but labels may be noisier because policy interpretation changes and reviewers disagree. The online test also needs stronger guardrails: prevalence of harmful views, false-positive takedowns, creator appeals, user reports, and quality metrics for unaffected content. A good DS answer balances safety, fairness, and product experience rather than optimizing one scalar metric.
Common pitfalls
Pitfall: Optimizing
accuracyor genericF1without tying the metric to the decision.
This is the classic analytical mistake. If positives are 0.1% of traffic, a model that predicts “not fraud” for everything has 99.9% accuracy and zero product value. A stronger answer defines the operating objective first, then picks PR-AUC, precision at recall, recall at fixed false-positive rate, NDCG@K, or expected cost as appropriate.
Pitfall: Treating offline model improvement as sufficient evidence to ship.
Saying “the new model has higher AUC, so launch it” misses calibration, segment regressions, label leakage, and online behavior change. Better communication is: “Offline metrics justify moving to a controlled experiment, but launch depends on primary metric lift, guardrails, and pre-defined ramp criteria.”
Pitfall: Mentioning calibration without explaining how it affects decisions.
A shallow answer says “I’d check calibration” and stops there. A deeper answer explains that calibrated probabilities enable thresholding by expected value, fair comparison across cohorts, and stable review queues; then names diagnostics like reliability plots, Brier score, ECE, and calibration by segment.
Connections
Interviewers often pivot from model evaluation to experiment design, especially power, guardrail metrics, heterogeneous treatment effects, and launch decisions. They may also probe causal inference, ranking metrics, label bias, or metric design for recommender and integrity systems. Be ready to explain how offline validation, online experimentation, and product impact fit together.
Further reading
-
“The Relationship Between Precision-Recall and ROC Curves” — Davis and Goadrich, 2006 — foundational paper for understanding why
PRcurves are often preferable under class imbalance. -
“Predicting Good Probabilities with Supervised Learning” — Niculescu-Mizil and Caruana, 2005 — practical treatment of calibration methods and why high-ranking performance does not imply reliable probabilities.
-
“Hidden Technical Debt in Machine Learning Systems” — Sculley et al., 2015 — useful context on leakage, feedback loops, and production model evaluation risks from an ML systems perspective.
Practice questions
- Choose threshold under asymmetric costsMeta · Data Scientist · Onsite · Medium
- Replace legacy ads model safelyMeta · Data Scientist · Onsite · hard
- Detect leakage and evaluate a prediction modelMeta · Data Scientist · Onsite · hard
- Evaluate fraud classifier with cost-sensitive metricsMeta · Data Scientist · Technical Screen · hard
- Evaluate Product-Ranking Algorithm with Precision and Recall MetricsMeta · Data Scientist · Onsite · medium
- Evaluate New Model's Performance Against Existing SystemMeta · Data Scientist · Technical Screen · medium
- Evaluate and Experiment with Harmful Content Detection ModelMeta · Data Scientist · Technical Screen · medium
Related concepts
- Classifier Evaluation, Calibration, And Thresholding
- ML Model Evaluation, Metrics, And ExperimentationML System Design
- Machine Learning Model Design And EvaluationMachine Learning
- ML Model Evaluation And Calibration
- Model Evaluation and Calibration
- Applied Machine Learning Modeling And EvaluationMachine Learning