Classifier Evaluation And Cost-Sensitive Thresholds

What's being tested

Interviewers are probing whether you can evaluate a binary classifier as a decision system, not just as a model with a high AUC. For Meta Data Scientist roles, this matters because classifiers often mediate product decisions with asymmetric costs: removing harmful content, flagging fraud, ranking likely interests, or escalating items for review. A strong answer connects offline evaluation, probability calibration, threshold selection, class imbalance, and online experimentation into one coherent framework. The bar is not “name precision and recall”; it is “choose the right metric and operating point for the product objective, quantify tradeoffs, and identify failure modes before launch.”

Core knowledge

Confusion matrix terms are the foundation: true positives, false positives, true negatives, and false negatives. From these derive precision = TP / (TP + FP), recall = TP / (TP + FN), FPR = FP / (FP + TN), and specificity = TN / (TN + FP).
Class imbalance changes metric interpretation. In fraud, harmful content, and rare-event prediction, accuracy can be misleading because a model that predicts “negative” for everyone may achieve 99% accuracy. Prefer PR-AUC, precision at fixed recall, recall at fixed precision, or cost-weighted utility.
ROC-AUC measures ranking quality across all thresholds and is insensitive to class prevalence in a way that can be useful but dangerous. For rare positives, PR-AUC is usually more product-relevant because it directly reflects positive-class purity: how many reviewed or acted-on items are actually problematic.
Threshold selection should optimize a decision objective, not default to 0.5. If predicted probability is calibrated and costs are known, act when expected benefit exceeds expected cost:
$p(y=1 \mid x) \cdot B_{TP} - (1 - p(y=1 \mid x)) \cdot C_{FP} > C_{action}$
Equivalently, choose a threshold based on the relative costs of false positives and false negatives.
Cost-sensitive evaluation translates product harm into measurable tradeoffs. For fraud, a false negative may lose money and increase ecosystem risk; a false positive may block a legitimate user. For harmful content, false negatives create safety risk; false positives can suppress valid speech and reduce creator trust.
Calibration means predicted scores correspond to empirical probabilities: among examples scored near 0.8, about 80% should be positive. Check reliability diagrams, calibration curves, Brier score, and expected calibration error. Ranking metrics can be good while calibration is poor, which breaks probability-based thresholding.
Calibration methods include Platt scaling, isotonic regression, and temperature scaling for neural models. Use a held-out calibration set, not the same data used to train the model. Isotonic can overfit on small samples; Platt scaling is more constrained but less flexible.
Capacity constraints often dominate thresholding. If only 50,000 posts per day can be human-reviewed, the operating point may be top-K by score rather than all examples above a fixed probability. Then evaluate precision@K, recall@K, reviewer yield, and marginal utility at the cutoff.
Offline validation must match deployment reality. Use time-based splits for systems with temporal drift, especially fraud or abuse models where adversaries adapt. Random splits can leak future behavior into training and inflate performance. For user-level prediction, split by user when repeated observations cause dependence.
Label quality is part of model evaluation. Human moderation labels may be noisy, delayed, policy-dependent, or sampled from previously flagged content. Estimate inter-rater agreement, audit uncertain regions near the threshold, and distinguish true model errors from ambiguous or stale labels.
Online experimentation should measure product and safety outcomes, not just model metrics. A model can improve offline PR-AUC while hurting DAU, appeals, trust, revenue, or reviewer workload. Use an A/B test, staged rollout, or shadow evaluation depending on risk, with guardrail metrics for user harm.
Segment analysis is essential at Meta scale. Evaluate by country, language, device, account age, creator size, traffic source, and protected or fairness-relevant groups where appropriate. Aggregate wins can hide severe degradation in small but important cohorts, especially when base rates differ.

Worked example

For “Evaluate fraud classifier with cost-sensitive metrics,” a strong candidate would start by clarifying the decision: are we blocking transactions, sending them to review, adding friction, or merely ranking risk for another system? They would ask for the base fraud rate, estimated dollar loss per false negative, user or revenue cost per false positive, review capacity, label delay, and whether scores are calibrated probabilities. The answer can then be organized around four pillars: offline evaluation, cost-sensitive thresholding, calibration, and online rollout.

For offline evaluation, they would avoid accuracy and report PR-AUC, recall at a fixed false-positive rate, precision at review capacity, and expected dollar utility by threshold. For thresholding, they would write the expected-value rule: act if expected fraud loss avoided exceeds customer friction, review cost, and false-positive cost. They would explicitly separate ranking quality from calibration: ROC-AUC may show the model orders risky users well, but if probabilities are miscalibrated, a threshold like p > 0.7 may be meaningless.

One design tradeoff to flag is conservative launch versus fraud capture: a lower threshold catches more fraud but may block legitimate users, so the best answer might use multiple actions, such as auto-block at very high risk, human review in the middle, and allow at low risk. They would close by proposing a shadow test or limited A/B test, monitoring fraud loss, false-positive appeals, chargebacks, user retention, and segment-level regressions. If they had more time, they would audit label delay and adversarial drift because fraud labels often arrive after the transaction and attacker behavior changes after deployment.

A second angle

For “Evaluate and Experiment with Harmful Content Detection Model,” the same evaluation logic applies, but the cost structure is more policy- and safety-sensitive than purely financial. False negatives may expose users to harmful content, while false positives may remove legitimate expression or disproportionately affect certain communities. Because harmful content is often rare, PR-AUC, recall at high precision, and reviewer precision@K matter more than aggregate accuracy.

The online experiment is also more delicate: if the model is risky, you may first run in shadow mode, compare decisions against the existing moderation system, and sample disagreements for human audit. Thresholds may differ by severity tier: terrorist content, self-harm content, spam, and borderline misinformation should not share one global cutoff. The transferable idea is that classifier evaluation is always a product decision under uncertainty, but the objective function and guardrails change by domain.

Common pitfalls

Pitfall: Optimizing the wrong metric.

A tempting answer is “use F1 because it balances precision and recall.” That is often too generic: F1 assumes false positives and false negatives have equal implicit importance and ignores business value. A stronger answer defines the cost of each outcome and chooses Fβ, fixed-precision recall, fixed-recall precision, precision@K, or expected utility accordingly.

Pitfall: Treating model score as a probability without checking calibration.

Many candidates say “set threshold at 0.8 for high confidence,” but a score of 0.8 is not necessarily an 80% probability. If the interviewer gives calibrated scores, you can use decision theory directly; otherwise, discuss calibration curves, Brier score, and calibrating on a held-out set before using probability thresholds.

Pitfall: Ignoring distribution shift and leakage.

A model can look excellent offline because features include future information, labels leak from the existing enforcement system, or the validation split randomly mixes future behavior into training. A better answer asks how labels are generated, whether features are available at decision time, and whether validation is time-aware and segment-aware.

Connections

Interviewers may pivot from this topic into experiment design, especially how to launch a new classifier safely with guardrail metrics and heterogeneous treatment effects. They may also move into causal inference if model interventions change the labels you later observe, or into ranking evaluation when the classifier is used to prioritize a limited review queue. Adjacent statistical topics include threshold confidence intervals, power analysis for rare events, and bias/fairness diagnostics across cohorts.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts