How would you manage precision/recall for fraud detection?
Company: TikTok
Role: Data Scientist
Category: Machine Learning
Difficulty: easy
Interview Round: Technical Screen
## Scenario
You own (or significantly contribute to) a production **fraud detection** system that flags transactions/users as *fraud* vs *legit*.
- The model outputs a fraud probability score \(p(\text{fraud})\).
- A decision threshold determines whether to **block**, **step-up verify**, or **send to manual review**.
- Labels may be delayed (chargebacks) and the data is **highly imbalanced**.
## Questions
1. **Precision/Recall management:** What concrete methods have you used (or would you use) to **measure, manage, and optimize precision and recall** in a real fraud system?
2. **False positives:** How would you diagnose and reduce **false positives** (legit users being flagged) without letting fraud through?
3. **Sudden fraud spike:** If you suddenly observe **many more fraud cases** than usual, what changes would you make (model, thresholding, monitoring, operations), and how would you validate them quickly?
4. **Specific fraud pattern:** If fraud shows a **very specific/pattern** (e.g., a new attack vector with clear signatures), what would you do—rules, model features, segmentation, retraining—and how would you prevent overfitting to a short-lived pattern?
Please be explicit about:
- The **primary metric** vs **diagnostic metrics** vs **guardrails** you would use.
- How you handle **cost asymmetry** (FP vs FN), **label delay**, and **distribution shift/adversarial adaptation**.
- The trade-off between **product/user experience** and **fraud loss**.
Quick Answer: This question evaluates a candidate's competency in applied machine learning for fraud detection, covering model performance measurement, thresholding, monitoring, and operational decisioning in a production setting; it is categorized under Machine Learning with a fraud-detection domain focus and tests both conceptual understanding and practical application for a data scientist role. It is commonly asked to probe the ability to select and balance a primary metric versus diagnostic metrics and operational guardrails, reason about cost asymmetry, label delay and distribution shift, and weigh trade-offs between product/user experience and fraud loss.