PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Google

Design and critique an abuse-detection ML system

Last updated: Mar 29, 2026

Quick Overview

This question evaluates system-design and production machine learning competencies including large-scale classification versus risk scoring, handling extreme class imbalance and delayed labels, calibration and thresholding under a fixed human-review budget, near-real-time feature engineering, robustness and drift detection, and privacy and fairness trade-offs. It is commonly asked in the Machine Learning domain for Data Scientist roles to test an interviewee's ability to balance statistical objectives, operational constraints and ethical considerations; the category tested is Machine Learning (Trust & Safety) and the level of abstraction spans both conceptual understanding and practical application in production systems, English summary.

  • hard
  • Google
  • Machine Learning
  • Data Scientist

Design and critique an abuse-detection ML system

Company: Google

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Onsite

Describe, in depth, how you would design an ML system to identify and triage abusive content uploads in a Trust & Safety context where only 0.2% of items are true violations and labels arrive with a median delay of 36 hours. Cover the following, with concrete choices and trade-offs: 1) Problem framing: binary classification vs risk scoring; objective appropriate for extreme class imbalance (e.g., optimize PR-AUC or cost-weighted utility). Define the positive label precisely given noisy moderation decisions (e.g., prioritize the latest decision within 7 days; handle conflicting labels). 2) Data and features: architecture for near-real-time features (user history aggregates, text/image embeddings, graph signals), leakage audits, and privacy constraints (minimize PII retention; differential privacy or k-anonymity where needed). 3) Training: sampling/weighting strategy (e.g., class weights, focal loss, hard negative mining), handling delayed labels (label lag queues, exclusion windows), and calibration (isotonic/Platt). Specify your validation split that respects time and user leakage. 4) Thresholding under review budget: given 2,000,000 daily items and a human review budget of 10,000/day, describe exactly how you would pick a threshold t on calibrated scores to maximize expected true violations sent to review subject to budget. Include the computation using score quantiles from a recent labeled window and how you would re-tune t as demand drifts. 5) Online evaluation: guardrail metrics (latency, false positive rate on high-trust creators, geographic fairness), interleaving/canary design, and how you would measure lift vs baseline heuristics with delayed ground truth. 6) Robustness and drift: detection of covariate/label drift, adversarial adaptation signals, periodic re-training policy, and fail-safe degradations during P0 incidents. 7) Ethics and fairness: define and monitor group-specific error rates; propose mitigation (re-weighting, post-hoc calibration) if disparities exceed thresholds. Explain escalation when utility vs fairness trade-offs conflict. 8) Post-launch monitoring: dashboards, alert thresholds, and a rollback plan tied to concrete SLAs. For each section, provide at least one specific metric and a crisp decision rule you would actually use in production.

Quick Answer: This question evaluates system-design and production machine learning competencies including large-scale classification versus risk scoring, handling extreme class imbalance and delayed labels, calibration and thresholding under a fixed human-review budget, near-real-time feature engineering, robustness and drift detection, and privacy and fairness trade-offs. It is commonly asked in the Machine Learning domain for Data Scientist roles to test an interviewee's ability to balance statistical objectives, operational constraints and ethical considerations; the category tested is Machine Learning (Trust & Safety) and the level of abstraction spans both conceptual understanding and practical application in production systems, English summary.

Related Interview Questions

  • Explain ranking cold-start strategies - Google (medium)
  • Explain LLM fine-tuning and generative models - Google (medium)
  • Compare NLP tokenization and LLM recommendations - Google (medium)
  • Explain LLM lifecycle and trade-offs - Google (medium)
  • Build a bigram next-word predictor with weighted sampling - Google (medium)
|Home/Machine Learning/Google

Design and critique an abuse-detection ML system

Google logo
Google
Oct 13, 2025, 9:49 PM
hardData ScientistOnsiteMachine Learning
7
0
Loading...

ML System Design: Abusive Content Detection and Triage (Trust & Safety)

Context: You are designing an ML system to identify and triage abusive content uploads in a Trust & Safety context. Only 0.2% of items are true violations, and human labels arrive with a median delay of 36 hours. The system must operate near real-time and route a fixed number of items to human review daily.

Cover the following, with concrete choices and trade-offs:

  1. Problem Framing
  • Binary classification vs. risk scoring.
  • Objective appropriate for extreme class imbalance (e.g., PR-AUC or cost-weighted utility).
  • Define the positive label precisely given noisy moderation decisions (e.g., prioritize the latest decision within 7 days; handle conflicting labels).
  1. Data and Features
  • Architecture for near-real-time features (user history aggregates, text/image embeddings, graph signals).
  • Leakage audits.
  • Privacy constraints (minimize PII retention; differential privacy or k-anonymity where needed).
  1. Training
  • Sampling/weighting strategy (e.g., class weights, focal loss, hard negative mining).
  • Handling delayed labels (label lag queues, exclusion windows).
  • Calibration (isotonic/Platt).
  • Validation split that respects time and user leakage.
  1. Thresholding Under Review Budget
  • Given 2,000,000 daily items and a human review budget of 10,000/day, describe exactly how to pick a threshold t on calibrated scores to maximize expected true violations sent to review subject to budget.
  • Include the computation using score quantiles from a recent labeled window and how to re-tune t as demand drifts.
  1. Online Evaluation
  • Guardrail metrics (latency, false positive rate on high-trust creators, geographic fairness).
  • Interleaving/canary design.
  • How to measure lift vs baseline heuristics with delayed ground truth.
  1. Robustness and Drift
  • Detection of covariate/label drift, adversarial adaptation signals.
  • Periodic re-training policy.
  • Fail-safe degradations during P0 incidents.
  1. Ethics and Fairness
  • Define and monitor group-specific error rates; propose mitigation (re-weighting, post-hoc calibration) if disparities exceed thresholds.
  • Explain escalation when utility vs fairness trade-offs conflict.
  1. Post-Launch Monitoring
  • Dashboards, alert thresholds, and a rollback plan tied to concrete SLAs.

For each section, provide at least one specific metric and a crisp decision rule you would actually use in production.

Loading comments...

Browse More Questions

More Machine Learning•More Google•More Data Scientist•Google Data Scientist•Google Machine Learning•Data Scientist Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.