You are a Data Scientist at a Twitter-like app. The platform suspects an increase in “stolen posts” (users reposting/plagiarizing others’ content without attribution), and a new algorithm has been built to reduce stolen posts.
Answer the following product/DS questions.
Given only the basic post table schema (post id/author/time/type/content/parent), list the additional data you would request to reliably determine whether a post is “stolen.”
Assume you propose a detection methodology (rules, ML, similarity search, etc.). Explain key limitations and failure modes.
Describe potential user, creator, and platform harms. Include at least one harm that affects metrics or model feedback loops.
Design an evaluation plan (online experiment or quasi-experiment) to measure whether the algorithm reduces stolen posts without hurting the product.
Your plan should include:
Login required