This question evaluates proficiency in ML system design for harmful-content detection, assessing competencies in multimodal modeling, taxonomy and labeling strategies, data strategy, privacy-aware architecture, real-time inference, human-in-the-loop decisioning, and operational reliability for global consumer platforms.
Design an end-to-end harmful content detection system. Define the taxonomy (e.g., hate, self-harm, sexual content, violence), labeling guidelines and quality controls, and multilingual/multimodal scope (text, image, audio, video). Propose model choices (keyword baselines, classical ML, transformers, multimodal encoders) and training data strategy (collection, active learning, long-tail sampling, debiasing). Specify inference architecture (streaming vs. batch), thresholds and severity tiers, human-in-the-loop review, appeals/override flows, and explainability requirements. Address adversarial behavior (evasion, prompt injection), privacy and safety constraints, fairness and error costs (precision/recall trade-offs by class and region), monitoring and drift detection, A/B rollout, and feedback loops for continuous improvement.
Quick Answer: This question evaluates proficiency in ML system design for harmful-content detection, assessing competencies in multimodal modeling, taxonomy and labeling strategies, data strategy, privacy-aware architecture, real-time inference, human-in-the-loop decisioning, and operational reliability for global consumer platforms.
hardMachine Learning EngineerOnsiteML System Design
8
0
Design a Harmful Content Detection System (Multilingual, Multimodal)
Problem Statement
You are designing a trust-and-safety system for a large, mobile-first, ephemeral consumer social platform with a global user base that includes teens. Users share multimodal content across many surfaces: 1:1 and group chat, short-form video (spotlight/stories), AR lenses/effects, and live audio/video. The platform must detect and act on harmful content in near real-time while respecting privacy, regional policies, and user trust.
Design an end-to-end harmful content detection system. The problem is broken into four parts below; address each part, then deliver a cohesive design that integrates the components into a single operational system and explains your key assumptions and trade-offs.
Constraints & Assumptions
State your own where the prompt is silent, but anchor to these:
Scale:
tens of thousands of pieces of content per second at peak, across multiple modalities; hundreds of millions of users globally.
Latency by surface:
chat and uploads should be moderated
before delivery
(interactive, sub-second budgets); live streams are moderated continuously over a sliding window; some content can be re-scanned in batch.
Modalities:
text (including OCR text-in-image/video), image, audio (ASR transcripts + acoustic cues), and video (visual frames + motion), plus metadata/context (surface, locale, age band, account history).
Ephemerality & privacy:
content is short-lived by design; data minimization, encryption, regional legal constraints, and
age-appropriate design
for minors are hard requirements.
Severity is not binary:
harmful categories range from "reduce reach" to "block + escalate to law enforcement" (e.g., CSAM), so the system must produce graded, policy-driven actions, not a single block/allow bit.
Clarifying Questions to Ask
A candidate should scope the whole problem before diving in. Good questions include:
Which
harm categories
are in scope and what are the legal/zero-tolerance ones (e.g., CSAM, terrorism) versus policy-discretion ones (e.g., bullying)?
What are the
per-surface latency and action budgets
— must we block chat/uploads pre-delivery, or is post-hoc takedown acceptable for some surfaces?
What is the relative
cost of false positives vs. false negatives
per category, and does it differ by region or by minor vs. adult accounts?
What
languages and regions
must we support at launch, including low-resource languages and locale-specific policy?
What
labeled data, enforcement logs, and industry hash-sharing programs
(e.g., NCMEC/GIFCT) do we have access to, and what are the retention/access constraints on training data?
Is there an
on-device
budget (model size, battery, NPU) for pre-upload checks, and what privacy guarantees must on-device inference satisfy?
Part 1 — Taxonomy, Labeling & Multimodal Scope
Define a practical, multi-label taxonomy of harm categories (e.g., hate/harassment, self-harm/suicide, sexual content, violence/gore, illegal/regulated) with subcategories and severity tiers. Specify the labeling guidelines and quality controls for annotation, and define the multilingual and multimodal scope plus context rules (multi-turn chat, text-in-image, satire/condemnation/educational context).
What This Part Should Cover
A coherent,
hierarchical multi-label taxonomy
with subcategories AND ordinal severity tiers (not a flat list of classes).
Concrete
annotation quality controls
: gold seeding, IAA targets, adjudication, SME tiers for the most sensitive classes.
Multimodal coverage
mapped to the taxonomy (text/OCR, image, audio/ASR, video) and explicit
context rules
for multi-turn and intent.
Awareness of
sensitive-class handling
(e.g., never collecting/annotating real CSAM directly; using hashes/redacted signals).
Part 2 — Modeling & Training Data Strategy
Propose a modeling stack spanning keyword/heuristic baselines, classical ML, modern transformers, and multimodal encoders, and justify which tool fits which job. Then describe the training-data strategy: data collection, active learning, long-tail sampling, debiasing, and handling sensitive classes.
What This Part Should Cover
A
justified progression
from baselines to advanced models, with the right model per modality and a clear cascade/escalation rationale (cost vs. accuracy).
Score calibration
(e.g., Platt/Isotonic per class & locale) and out-of-distribution / low-confidence routing.
A realistic
data pipeline
: sources (enforcement logs, public datasets, hash-sharing, synthetic), active learning, long-tail oversampling, hard-negative mining.
Concrete
debiasing and sensitive-class
practices, including how CSAM/terror data is handled without direct collection.
Part 3 — Inference Architecture & Decisioning
Specify the inference architecture (streaming vs. batch, on-device vs. server, latency targets by surface), the decisioning layer (thresholds, severity tiers → policy actions such as block, age-gate, interstitial, downrank, quarantine), the human-in-the-loop flows (triage queues, escalation, appeals/overrides), and the explainability requirements for moderators and for user-facing transparency.
What This Part Should Cover
A concrete
serving topology
(per-modality microservices + fusion, autoscaling, model registry/versioning, canary) with
per-surface latency budgets
and streaming-vs-batch placement.
A
decision engine
that maps calibrated scores + severity + context to graded actions, with cost-sensitive (not arbitrary) thresholds.
A complete
human-in-the-loop
design: triage/escalation, crisis routing for self-harm, appeals/overrides, audit logging.
Explainability
for both moderators (evidence, saliency, exemplars) and users (reason codes without revealing exact rules).
Part 4 — Risk, Privacy, Fairness & Reliability
Address the operational-risk dimensions: adversarial behavior (evasion via obfuscation/text-in-image/coded language; prompt injection for any LLM components; model hardening), privacy & safety constraints (data minimization, retention, encryption, age-appropriate design), fairness & error costs (precision/recall trade-offs by class and region; group fairness), and monitoring/drift/A/B rollout/feedback loops for continuous improvement.
What This Part Should Cover
A credible
adversarial threat model
and concrete defenses, including
prompt-injection
handling if any LLM is used.
Privacy-by-design
: data minimization, short retention for ephemeral content, encryption, least-privilege reviewer access, stricter rules for minors.
A
fairness framework
with per-group/region metrics and counterfactual evaluation, tied to per-class error costs.
A
reliability loop
: monitoring/SLOs, data/concept/embedding drift detection, staged rollout (shadow → canary → A/B) with kill-switch, and feedback loops that avoid self-reinforcement.
What a Strong Answer Covers
Across all four parts, a strong answer is judged on how well the pieces cohere into one operational system rather than four disconnected essays:
End-to-end coherence:
the severity tiers from Part 1 are exactly what the decision engine in Part 3 acts on; the calibration in Part 2 is what makes the cost-sensitive thresholds in Part 3 valid; the drift/fairness signals in Part 4 feed the active-learning loop in Part 2.
Explicit trade-offs:
latency vs. accuracy (cascade design), precision vs. recall by severity, on-device privacy vs. capability, transparency vs. gameability.
Safety-first prioritization:
zero-tolerance categories are conservative-by-default (human-review floor, low thresholds, hash-matching) while discretionary categories favor reduce-reach over over-blocking.
Realism on the hard parts:
ephemerality/privacy constraints, low-resource languages, adversarial evasion, and feedback-loop bias are acknowledged, not hand-waved.
Follow-up Questions
A new evasion pattern (e.g., a coded slang term) spreads overnight and your models miss it. Walk through your
rapid-response
path from detection to a deployed fix without retraining from scratch.
Your CSAM classifier's precision is excellent but a regulator demands near-zero false negatives. How does this change your thresholds, human-review staffing, and the
noisy-OR fusion
across modalities?
For minor accounts you want stricter enforcement, but you must also minimize data collected about minors. How do you reconcile
stricter thresholds
with
data minimization
?
How would you design the
A/B test
for a new model when you cannot expose users to known-harmful content in the control arm? What is your metric and how do you avoid ethical/legal problems with the holdback?