Context
You are a Data Scientist on a social media platform working on harmful content (e.g., hate/harassment, self-harm, violence, sexual exploitation, misinformation). A team proposes a new intervention (e.g., ranking demotion, removal policy change, or a new ML classifier + enforcement workflow).
Task
-
Define “severity of harmful content.”
-
Propose a severity framework that supports both measurement and decision-making.
-
Explain tradeoffs (e.g., interpretability vs. sensitivity, policy alignment, subjectivity, multilingual considerations).
-
Design metrics (primary/diagnostic/guardrails).
-
Provide at least:
-
One
primary
metric capturing harmful content impact.
-
Several
diagnostic
metrics that help explain movement.
-
Several
guardrail
metrics to detect unintended harm.
-
Specify exact definitions (numerators/denominators) and any weighting (e.g., by exposure).
-
Design an experiment to evaluate the intervention.
-
Choose an appropriate
randomization unit
(viewer/user, author, content item, session, community/cluster, geo, etc.) and justify it.
-
Discuss common pitfalls: interference/spillover, network effects, contamination, novelty effects, delayed outcomes, and measurement error.
-
Explain how you would analyze results (e.g., intent-to-treat vs. per-protocol, variance reduction, segmentation, multiple testing).
-
Pros/cons and edge cases.
-
Highlight failure modes: label noise, policy changes during the test, adversarial behavior, reporting bias, and differences across regions/languages.
Assumptions
-
You can log impressions/views, engagement, reports, enforcement actions, and model outputs.
-
“Harmful content” is determined by a combination of policy rules, human review, and ML signals (imperfect).