Design and critique an abuse-detection ML system
Company: Google
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Onsite
Describe, in depth, how you would design an ML system to identify and triage abusive content uploads in a Trust & Safety context where only 0.2% of items are true violations and labels arrive with a median delay of 36 hours.
Cover the following, with concrete choices and trade-offs:
1) Problem framing: binary classification vs risk scoring; objective appropriate for extreme class imbalance (e.g., optimize PR-AUC or cost-weighted utility). Define the positive label precisely given noisy moderation decisions (e.g., prioritize the latest decision within 7 days; handle conflicting labels).
2) Data and features: architecture for near-real-time features (user history aggregates, text/image embeddings, graph signals), leakage audits, and privacy constraints (minimize PII retention; differential privacy or k-anonymity where needed).
3) Training: sampling/weighting strategy (e.g., class weights, focal loss, hard negative mining), handling delayed labels (label lag queues, exclusion windows), and calibration (isotonic/Platt). Specify your validation split that respects time and user leakage.
4) Thresholding under review budget: given 2,000,000 daily items and a human review budget of 10,000/day, describe exactly how you would pick a threshold t on calibrated scores to maximize expected true violations sent to review subject to budget. Include the computation using score quantiles from a recent labeled window and how you would re-tune t as demand drifts.
5) Online evaluation: guardrail metrics (latency, false positive rate on high-trust creators, geographic fairness), interleaving/canary design, and how you would measure lift vs baseline heuristics with delayed ground truth.
6) Robustness and drift: detection of covariate/label drift, adversarial adaptation signals, periodic re-training policy, and fail-safe degradations during P0 incidents.
7) Ethics and fairness: define and monitor group-specific error rates; propose mitigation (re-weighting, post-hoc calibration) if disparities exceed thresholds. Explain escalation when utility vs fairness trade-offs conflict.
8) Post-launch monitoring: dashboards, alert thresholds, and a rollback plan tied to concrete SLAs.
For each section, provide at least one specific metric and a crisp decision rule you would actually use in production.
Quick Answer: This question evaluates system-design and production machine learning competencies including large-scale classification versus risk scoring, handling extreme class imbalance and delayed labels, calibration and thresholding under a fixed human-review budget, near-real-time feature engineering, robustness and drift detection, and privacy and fairness trade-offs. It is commonly asked in the Machine Learning domain for Data Scientist roles to test an interviewee's ability to balance statistical objectives, operational constraints and ethical considerations; the category tested is Machine Learning (Trust & Safety) and the level of abstraction spans both conceptual understanding and practical application in production systems, English summary.