Classification Thresholds, Imbalanced Learning And Risk
Asked of: Data Scientist
Last updated

What's being tested
Interviewers are probing whether you can turn a probabilistic classifier into a risk-aware decision system, especially when positives are rare and mistakes have asymmetric costs. For a Data Scientist at TikTok, this matters in settings like fraud, spam, policy-risk detection, creator churn, user retention, and ads quality, where a model score is not the same thing as an action. Strong answers show fluency in class imbalance, threshold selection, cost-sensitive evaluation, calibration, and time-aware validation. The interviewer is not just asking “which model performs best,” but whether you can define the right objective, diagnose failure modes, and recommend a threshold that matches product constraints.
Core knowledge
-
Classification thresholds convert predicted probabilities or scores into actions: predict positive if . The optimal threshold is rarely
0.5; it depends on base rate, intervention capacity, user harm, false-positive cost, false-negative cost, and downstream action quality. -
Confusion matrix metrics are threshold-dependent:
precision,recall,FPR, andFNR. In imbalanced problems,accuracyis usually misleading because predicting the majority class can look excellent. -
Expected cost minimization gives a principled threshold when costs are known. If false positive cost is and false negative cost is , classify positive when , implying This assumes calibrated probabilities and equal action feasibility across users.
-
Precision-recall curves are often more informative than
ROC-AUCunder severe imbalance.ROC-AUCcan look high when negatives dominate, whilePR-AUCand precision at operational recall reveal whether positive predictions are usable for real intervention queues. -
Calibration means predicted probabilities match empirical frequencies: among users scored
0.8, about 80% should be positive. Use reliability plots,Brier score, expected calibration error,Platt scaling, or isotonic regression. Threshold policies based on costs require calibrated probabilities; ranking-only models need not be perfectly calibrated. -
Class imbalance handling should be separated into training, evaluation, and decisioning. Techniques include class weights, focal loss, downsampling negatives, oversampling positives,
SMOTE, balanced bagging, and threshold tuning. Always evaluate on the natural population distribution unless the production population is intentionally reweighted. -
Bagging versus boosting differs under imbalance. Bagging methods like
RandomForestreduce variance and can use balanced bootstrap samples; boosting methods likeXGBoost,LightGBM, orCatBoostoften perform well withscale_pos_weight, custom loss, and careful early stopping. Boosting can over-focus on mislabeled rare positives if labels are noisy. -
Time-aware validation is critical for fraud and churn. Random splits leak future behavior into training and overstate performance. Prefer train-on-past, validate-on-future splits, rolling windows, cohort-based evaluation, and monitoring performance by event time because fraud patterns, seasonality, and user lifecycle dynamics drift.
-
Operational constraints often define the threshold. If a human review team can handle 10,000 cases/day, choose the threshold that yields the top 10,000 highest-risk cases and report expected precision/recall at that capacity. If user friction is costly, enforce a maximum
FPRor maximum number of false blocks. -
Segment-level thresholds may be better than one global threshold when base rates and costs differ by country, device type, payment method, creator cohort, tenure, or traffic source. But segment thresholds can create fairness and stability risks, so compare segment-level
FPR,FNR, calibration, and sample sizes before deployment. -
Business metrics should connect model performance to outcomes. For churn, a false positive may waste an incentive; a false negative may lose a high-LTV user. For fraud, a false positive may block a legitimate transaction; a false negative may cause chargeback loss. Translate
TP,FP,FNinto expected dollars, user trust, moderation load, or retention lift. -
Thresholds decay over time as base rates and score distributions shift. Monitor population score distribution, positive rate, precision on delayed labels, calibration drift, and segment-level lift. A fixed threshold can silently become too aggressive or too lenient after product changes, adversarial adaptation, or seasonal behavior shifts.
Worked example
For “How do you choose a classification threshold?”, a strong candidate would first ask: what action follows a positive prediction, what are the costs of false positives and false negatives, are probabilities calibrated, and is there a capacity constraint? They would state that threshold choice is a decision problem, not just a modeling problem, and that 0.5 is only reasonable under symmetric costs and calibrated probabilities. The answer can be organized around four pillars: define the objective, evaluate threshold-dependent metrics, incorporate business or operational constraints, and validate stability over time and segments.
A good skeleton answer might start with a validation set representing the real production distribution, then sweep thresholds and compute precision, recall, FPR, FNR, PR-AUC, expected cost, and action volume. If costs are known, choose the threshold minimizing expected cost; if capacity is fixed, choose the threshold that fills the intervention queue while maximizing expected value. The candidate should explicitly flag calibration: if scores are uncalibrated, the threshold can still rank users, but cost-based probability cutoffs are not trustworthy. One tradeoff to call out is whether to prioritize high precision to avoid harming legitimate users or high recall to catch more risky cases. A strong close would be: “If I had more time, I’d test threshold stability by week and segment, evaluate delayed labels, and run an online experiment to measure whether the intervention actually improves the downstream product metric.”
A second angle
For “Compare bagging vs boosting on imbalanced data,” the same concept appears through model selection rather than final decisioning. The candidate should explain that imbalance affects the learner’s loss function, the evaluation metric, and the threshold, so comparing RandomForest against XGBoost using only accuracy would be weak. Bagging with balanced samples can be robust and less sensitive to label noise, while boosting often achieves stronger ranking performance but may overfit rare noisy positives without regularization, time-aware validation, and class weighting. The threshold still must be tuned after model training because a boosted model with better ROC-AUC may not produce better precision at the actual review capacity. The framing shifts from “which cutoff should I use?” to “which model gives the best calibrated or ranked scores for the operating region I care about?”
Common pitfalls
Pitfall: Optimizing for
accuracyon a rare-event problem.
If fraud rate is 0.2%, a model that predicts “not fraud” for everyone gets 99.8% accuracy and is useless. A stronger answer says to evaluate PR-AUC, precision at top K, recall at fixed FPR, cost-weighted loss, and expected business impact under the real base rate.
Pitfall: Treating model score, probability, and decision as the same thing.
Many candidates say “I’ll use 0.5” or “I’ll pick the highest F1” without explaining whether scores are calibrated or what action follows. Better: separate ranking quality, probability calibration, and threshold/action policy; then choose the threshold using costs, constraints, or experimental evidence.
Pitfall: Ignoring time and delayed labels.
Fraud, churn, and retention labels often arrive late, and behavior changes over time. A random train/test split can leak future information and inflate performance; a stronger candidate uses forward-looking validation, cohort cuts, delayed-label awareness, and monitors performance drift by week and segment.
Connections
Interviewers may pivot from thresholding into model calibration, uplift modeling, causal measurement of interventions, ranking metrics, or online experiment design. For TikTok-style products, expect follow-ups on segment analysis, fairness of false positives, delayed feedback loops, and how offline model metrics connect to online product metrics like retention, trust, moderation quality, or revenue.
Further reading
-
“The Relationship Between Precision-Recall and ROC Curves” by Davis and Goadrich — foundational explanation of why PR curves are often better for imbalanced classification.
-
“Probability Calibration” in The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman — useful background on probabilistic classifiers and calibration.
-
“Learning from Imbalanced Data” by He and Garcia — classic survey covering sampling, cost-sensitive learning, and evaluation pitfalls.
Featured in interview prep guides
Practice questions
- Define Ultra success metrics and detect suspicious transactionsTikTok · Data Scientist · Technical Screen · easy
- When prioritize precision vs recallTikTok · Data Scientist · Technical Screen · easy
- How do you choose a classification threshold?TikTok · Data Scientist · Technical Screen · easy
- Analyze promo anomaly and design risk guardrailsTikTok · Data Scientist · HR Screen · Medium
- Optimize threshold using confusion matrix and costsTikTok · Data Scientist · Technical Screen · medium
- Compare bagging vs boosting on imbalanced dataTikTok · Data Scientist · Technical Screen · medium
- Predict User Churn with Effective Modeling TechniquesTikTok · Data Scientist · Onsite · medium
- Balance Customer Satisfaction with Fraud Prevention: Key Metrics to TrackTikTok · Data Scientist · Onsite · medium
- Predict Customer Churn with Machine Learning WorkflowTikTok · Data Scientist · Onsite · medium
- Design Real-Time Credit Card Fraud Detection SystemTikTok · Data Scientist · Onsite · hard
Related concepts
- Classifier Evaluation And Cost-Sensitive ThresholdsMachine Learning
- Classifier Evaluation, Calibration, And Thresholding
- Evaluation, Statistical Inference, And Class ImbalanceMachine Learning
- Logistic Regression, Regularization, And Imbalanced ClassificationMachine Learning
- Cost-Sensitive Threshold OptimizationStatistics & Math
- Classifier Evaluation, Calibration, And Thresholds