Classification Thresholds, Imbalanced Learning And Risk

What's being tested

Interviewers are probing whether you can turn a probabilistic classifier into a risk-aware decision system, especially when positives are rare and mistakes have asymmetric costs. For a Data Scientist at TikTok, this matters in settings like fraud, spam, policy-risk detection, creator churn, user retention, and ads quality, where a model score is not the same thing as an action. Strong answers show fluency in class imbalance, threshold selection, cost-sensitive evaluation, calibration, and time-aware validation. The interviewer is not just asking “which model performs best,” but whether you can define the right objective, diagnose failure modes, and recommend a threshold that matches product constraints.

Core knowledge

Classification thresholds convert predicted probabilities or scores into actions: predict positive if $\hat{p}(y=1 \mid x) \ge t$ . The optimal threshold is rarely 0.5; it depends on base rate, intervention capacity, user harm, false-positive cost, false-negative cost, and downstream action quality.
Confusion matrix metrics are threshold-dependent: precision $= TP/(TP+FP)$ , recall $= TP/(TP+FN)$ , FPR $= FP/(FP+TN)$ , and FNR $= FN/(FN+TP)$ . In imbalanced problems, accuracy is usually misleading because predicting the majority class can look excellent.
Expected cost minimization gives a principled threshold when costs are known. If false positive cost is $C_{FP}$ and false negative cost is $C_{FN}$ , classify positive when $C_{FP}(1-p) < C_{FN}p$ , implying $t^* = \frac{C_{FP}}{C_{FP}+C_{FN}}.$ This assumes calibrated probabilities and equal action feasibility across users.
Precision-recall curves are often more informative than ROC-AUC under severe imbalance. ROC-AUC can look high when negatives dominate, while PR-AUC and precision at operational recall reveal whether positive predictions are usable for real intervention queues.
Calibration means predicted probabilities match empirical frequencies: among users scored 0.8, about 80% should be positive. Use reliability plots, Brier score, expected calibration error, Platt scaling, or isotonic regression. Threshold policies based on costs require calibrated probabilities; ranking-only models need not be perfectly calibrated.
Class imbalance handling should be separated into training, evaluation, and decisioning. Techniques include class weights, focal loss, downsampling negatives, oversampling positives, SMOTE, balanced bagging, and threshold tuning. Always evaluate on the natural population distribution unless the production population is intentionally reweighted.
Bagging versus boosting differs under imbalance. Bagging methods like RandomForest reduce variance and can use balanced bootstrap samples; boosting methods like XGBoost, LightGBM, or CatBoost often perform well with scale_pos_weight, custom loss, and careful early stopping. Boosting can over-focus on mislabeled rare positives if labels are noisy.
Time-aware validation is critical for fraud and churn. Random splits leak future behavior into training and overstate performance. Prefer train-on-past, validate-on-future splits, rolling windows, cohort-based evaluation, and monitoring performance by event time because fraud patterns, seasonality, and user lifecycle dynamics drift.
Operational constraints often define the threshold. If a human review team can handle 10,000 cases/day, choose the threshold that yields the top 10,000 highest-risk cases and report expected precision/recall at that capacity. If user friction is costly, enforce a maximum FPR or maximum number of false blocks.
Segment-level thresholds may be better than one global threshold when base rates and costs differ by country, device type, payment method, creator cohort, tenure, or traffic source. But segment thresholds can create fairness and stability risks, so compare segment-level FPR, FNR, calibration, and sample sizes before deployment.
Business metrics should connect model performance to outcomes. For churn, a false positive may waste an incentive; a false negative may lose a high-LTV user. For fraud, a false positive may block a legitimate transaction; a false negative may cause chargeback loss. Translate TP, FP, FN into expected dollars, user trust, moderation load, or retention lift.
Thresholds decay over time as base rates and score distributions shift. Monitor population score distribution, positive rate, precision on delayed labels, calibration drift, and segment-level lift. A fixed threshold can silently become too aggressive or too lenient after product changes, adversarial adaptation, or seasonal behavior shifts.

Worked example

For “How do you choose a classification threshold?”, a strong candidate would first ask: what action follows a positive prediction, what are the costs of false positives and false negatives, are probabilities calibrated, and is there a capacity constraint? They would state that threshold choice is a decision problem, not just a modeling problem, and that 0.5 is only reasonable under symmetric costs and calibrated probabilities. The answer can be organized around four pillars: define the objective, evaluate threshold-dependent metrics, incorporate business or operational constraints, and validate stability over time and segments.

A good skeleton answer might start with a validation set representing the real production distribution, then sweep thresholds and compute precision, recall, FPR, FNR, PR-AUC, expected cost, and action volume. If costs are known, choose the threshold minimizing expected cost; if capacity is fixed, choose the threshold that fills the intervention queue while maximizing expected value. The candidate should explicitly flag calibration: if scores are uncalibrated, the threshold can still rank users, but cost-based probability cutoffs are not trustworthy. One tradeoff to call out is whether to prioritize high precision to avoid harming legitimate users or high recall to catch more risky cases. A strong close would be: “If I had more time, I’d test threshold stability by week and segment, evaluate delayed labels, and run an online experiment to measure whether the intervention actually improves the downstream product metric.”

A second angle

For “Compare bagging vs boosting on imbalanced data,” the same concept appears through model selection rather than final decisioning. The candidate should explain that imbalance affects the learner’s loss function, the evaluation metric, and the threshold, so comparing RandomForest against XGBoost using only accuracy would be weak. Bagging with balanced samples can be robust and less sensitive to label noise, while boosting often achieves stronger ranking performance but may overfit rare noisy positives without regularization, time-aware validation, and class weighting. The threshold still must be tuned after model training because a boosted model with better ROC-AUC may not produce better precision at the actual review capacity. The framing shifts from “which cutoff should I use?” to “which model gives the best calibrated or ranked scores for the operating region I care about?”

Common pitfalls

Pitfall: Optimizing for accuracy on a rare-event problem.

If fraud rate is 0.2%, a model that predicts “not fraud” for everyone gets 99.8% accuracy and is useless. A stronger answer says to evaluate PR-AUC, precision at top K, recall at fixed FPR, cost-weighted loss, and expected business impact under the real base rate.

Pitfall: Treating model score, probability, and decision as the same thing.

Many candidates say “I’ll use 0.5” or “I’ll pick the highest F1” without explaining whether scores are calibrated or what action follows. Better: separate ranking quality, probability calibration, and threshold/action policy; then choose the threshold using costs, constraints, or experimental evidence.

Pitfall: Ignoring time and delayed labels.

Fraud, churn, and retention labels often arrive late, and behavior changes over time. A random train/test split can leak future information and inflate performance; a stronger candidate uses forward-looking validation, cohort cuts, delayed-label awareness, and monitors performance drift by week and segment.

Connections

Interviewers may pivot from thresholding into model calibration, uplift modeling, causal measurement of interventions, ranking metrics, or online experiment design. For TikTok-style products, expect follow-ups on segment analysis, fairness of false positives, delayed feedback loops, and how offline model metrics connect to online product metrics like retention, trust, moderation quality, or revenue.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts