PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Google

Design and critique an abuse-detection ML system

Last updated: Mar 29, 2026

Quick Overview

This question evaluates system-design and production machine learning competencies including large-scale classification versus risk scoring, handling extreme class imbalance and delayed labels, calibration and thresholding under a fixed human-review budget, near-real-time feature engineering, robustness and drift detection, and privacy and fairness trade-offs. It is commonly asked in the Machine Learning domain for Data Scientist roles to test an interviewee's ability to balance statistical objectives, operational constraints and ethical considerations; the category tested is Machine Learning (Trust & Safety) and the level of abstraction spans both conceptual understanding and practical application in production systems, English summary.

  • hard
  • Google
  • Machine Learning
  • Data Scientist

Design and critique an abuse-detection ML system

Company: Google

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Onsite

Describe, in depth, how you would design an ML system to identify and triage abusive content uploads in a Trust & Safety context where only 0.2% of items are true violations and labels arrive with a median delay of 36 hours. Cover the following, with concrete choices and trade-offs: 1) Problem framing: binary classification vs risk scoring; objective appropriate for extreme class imbalance (e.g., optimize PR-AUC or cost-weighted utility). Define the positive label precisely given noisy moderation decisions (e.g., prioritize the latest decision within 7 days; handle conflicting labels). 2) Data and features: architecture for near-real-time features (user history aggregates, text/image embeddings, graph signals), leakage audits, and privacy constraints (minimize PII retention; differential privacy or k-anonymity where needed). 3) Training: sampling/weighting strategy (e.g., class weights, focal loss, hard negative mining), handling delayed labels (label lag queues, exclusion windows), and calibration (isotonic/Platt). Specify your validation split that respects time and user leakage. 4) Thresholding under review budget: given 2,000,000 daily items and a human review budget of 10,000/day, describe exactly how you would pick a threshold t on calibrated scores to maximize expected true violations sent to review subject to budget. Include the computation using score quantiles from a recent labeled window and how you would re-tune t as demand drifts. 5) Online evaluation: guardrail metrics (latency, false positive rate on high-trust creators, geographic fairness), interleaving/canary design, and how you would measure lift vs baseline heuristics with delayed ground truth. 6) Robustness and drift: detection of covariate/label drift, adversarial adaptation signals, periodic re-training policy, and fail-safe degradations during P0 incidents. 7) Ethics and fairness: define and monitor group-specific error rates; propose mitigation (re-weighting, post-hoc calibration) if disparities exceed thresholds. Explain escalation when utility vs fairness trade-offs conflict. 8) Post-launch monitoring: dashboards, alert thresholds, and a rollback plan tied to concrete SLAs. For each section, provide at least one specific metric and a crisp decision rule you would actually use in production.

Quick Answer: This question evaluates system-design and production machine learning competencies including large-scale classification versus risk scoring, handling extreme class imbalance and delayed labels, calibration and thresholding under a fixed human-review budget, near-real-time feature engineering, robustness and drift detection, and privacy and fairness trade-offs. It is commonly asked in the Machine Learning domain for Data Scientist roles to test an interviewee's ability to balance statistical objectives, operational constraints and ethical considerations; the category tested is Machine Learning (Trust & Safety) and the level of abstraction spans both conceptual understanding and practical application in production systems, English summary.

Related Interview Questions

  • Explain ranking cold-start strategies - Google (medium)
  • Explain LLM fine-tuning and generative models - Google (medium)
  • Compare NLP tokenization and LLM recommendations - Google (medium)
  • Explain LLM lifecycle and trade-offs - Google (medium)
  • Build a bigram next-word predictor with weighted sampling - Google (medium)
Google logo
Google
Oct 13, 2025, 9:49 PM
Data Scientist
Onsite
Machine Learning
5
0
Loading...

ML System Design: Abusive Content Detection and Triage (Trust & Safety)

Context: You are designing an ML system to identify and triage abusive content uploads in a Trust & Safety context. Only 0.2% of items are true violations, and human labels arrive with a median delay of 36 hours. The system must operate near real-time and route a fixed number of items to human review daily.

Cover the following, with concrete choices and trade-offs:

  1. Problem Framing
  • Binary classification vs. risk scoring.
  • Objective appropriate for extreme class imbalance (e.g., PR-AUC or cost-weighted utility).
  • Define the positive label precisely given noisy moderation decisions (e.g., prioritize the latest decision within 7 days; handle conflicting labels).
  1. Data and Features
  • Architecture for near-real-time features (user history aggregates, text/image embeddings, graph signals).
  • Leakage audits.
  • Privacy constraints (minimize PII retention; differential privacy or k-anonymity where needed).
  1. Training
  • Sampling/weighting strategy (e.g., class weights, focal loss, hard negative mining).
  • Handling delayed labels (label lag queues, exclusion windows).
  • Calibration (isotonic/Platt).
  • Validation split that respects time and user leakage.
  1. Thresholding Under Review Budget
  • Given 2,000,000 daily items and a human review budget of 10,000/day, describe exactly how to pick a threshold t on calibrated scores to maximize expected true violations sent to review subject to budget.
  • Include the computation using score quantiles from a recent labeled window and how to re-tune t as demand drifts.
  1. Online Evaluation
  • Guardrail metrics (latency, false positive rate on high-trust creators, geographic fairness).
  • Interleaving/canary design.
  • How to measure lift vs baseline heuristics with delayed ground truth.
  1. Robustness and Drift
  • Detection of covariate/label drift, adversarial adaptation signals.
  • Periodic re-training policy.
  • Fail-safe degradations during P0 incidents.
  1. Ethics and Fairness
  • Define and monitor group-specific error rates; propose mitigation (re-weighting, post-hoc calibration) if disparities exceed thresholds.
  • Explain escalation when utility vs fairness trade-offs conflict.
  1. Post-Launch Monitoring
  • Dashboards, alert thresholds, and a rollback plan tied to concrete SLAs.

For each section, provide at least one specific metric and a crisp decision rule you would actually use in production.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Google•More Data Scientist•Google Data Scientist•Google Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.