This question evaluates system-design and production machine learning competencies including large-scale classification versus risk scoring, handling extreme class imbalance and delayed labels, calibration and thresholding under a fixed human-review budget, near-real-time feature engineering, robustness and drift detection, and privacy and fairness trade-offs. It is commonly asked in the Machine Learning domain for Data Scientist roles to test an interviewee's ability to balance statistical objectives, operational constraints and ethical considerations; the category tested is Machine Learning (Trust & Safety) and the level of abstraction spans both conceptual understanding and practical application in production systems, English summary.
Context: You are designing an ML system to identify and triage abusive content uploads in a Trust & Safety context. Only 0.2% of items are true violations, and human labels arrive with a median delay of 36 hours. The system must operate near real-time and route a fixed number of items to human review daily.
Cover the following, with concrete choices and trade-offs:
For each section, provide at least one specific metric and a crisp decision rule you would actually use in production.
Login required