How do you choose a classification threshold?
Company: TikTok
Role: Data Scientist
Category: Machine Learning
Difficulty: easy
Interview Round: Technical Screen
## Context
You built a **binary sentiment classification** model (e.g., positive vs. negative) and need to deploy it in a product where actions depend on the model’s output.
## Questions
1. **Walk through your ML pipeline** end-to-end:
- Data sourcing/labeling and dataset construction (train/validation/test splits).
- Feature design or model choice (e.g., TF-IDF + linear model vs. transformer).
- Training procedure and evaluation setup.
- Key practical challenges (class imbalance, noisy labels, distribution shift) and how you handled them.
2. **Modeling choices:**
- Why did you choose method/model **X** over alternatives?
- What assumptions does it make, and what trade-offs does it introduce (latency, interpretability, cost, robustness)?
3. **Iterative refinement:**
- Describe how you **iteratively improved** the system (e.g., error analysis → new features/data → retrain → re-evaluate).
- What were your biggest learnings and what would you do differently next time?
4. **Threshold selection for deployment:**
- Your model outputs a probability score. How do you choose the **decision threshold**?
- Which metrics would you consider (precision, recall, F1, ROC-AUC, PR-AUC, cost-weighted loss), and which would be **primary vs. diagnostic vs. guardrail**?
- How would the answer change under:
- Severe class imbalance
- Different costs for false positives vs. false negatives
- A fixed review/ops capacity (e.g., only 1,000 items/day can be escalated)
5. **Metric definition:**
- If a stakeholder proposes defining “success” as metric **XXX**, how do you evaluate whether that definition is appropriate?
- What data issues (label leakage, delayed labels, sampling bias) could make the metric misleading?
Quick Answer: This question evaluates a data scientist's competency in binary classification threshold selection, end-to-end ML pipeline design, model evaluation, and operational trade-offs including class imbalance, asymmetric error costs, and limited escalation capacity, within the Machine Learning domain of model evaluation and deployment.