You are a Data Scientist at a Twitter-like app. The platform suspects an increase in “stolen posts” (users reposting/plagiarizing others’ content without attribution), and a new algorithm has been built to reduce stolen posts.
Answer the following product/DS questions.
1) What additional information do you need?
Given only the basic post table schema (post id/author/time/type/content/parent), list the additional data you would request to reliably determine whether a post is “stolen.”
2) What are drawbacks of your methodology?
Assume you propose a detection methodology (rules, ML, similarity search, etc.). Explain key limitations and failure modes.
3) What harms can stolen posts cause?
Describe potential user, creator, and platform harms. Include at least one harm that affects metrics or model feedback loops.
4) How do you evaluate the new algorithm’s effectiveness?
Design an evaluation plan (online experiment or quasi-experiment) to measure whether the algorithm reduces stolen posts without hurting the product.
Your plan should include:
-
A clear
primary metric
plus
diagnostic
and
guardrail
metrics
-
How you will handle
confounding
(e.g., seasonality, creator mix shifts, enforcement effects)
-
How to validate that the metric truly reflects “stolen post” reduction (label quality / delayed labels)
-
What decision rule you would use to launch/iterate