Detect and evaluate "stolen" posts

Q: Detect and evaluate "stolen" posts

This is a Analytics & Experimentation interview question from Tools For Humanity for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Analytics & Experimentation interview questions?

Analytics & Experimentation questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master analytics & experimentation interviews.

Question

You are a Data Scientist at a Twitter-like app. The platform suspects an increase in “stolen posts” (users reposting/plagiarizing others’ content without attribution), and a new algorithm has been built to reduce stolen posts.

Answer the following product/DS questions.

1) What additional information do you need?

Given only the basic post table schema (post id/author/time/type/content/parent), list the additional data you would request to reliably determine whether a post is “stolen.”

2) What are drawbacks of your methodology?

Assume you propose a detection methodology (rules, ML, similarity search, etc.). Explain key limitations and failure modes.

3) What harms can stolen posts cause?

Describe potential user, creator, and platform harms. Include at least one harm that affects metrics or model feedback loops.

4) How do you evaluate the new algorithm’s effectiveness?

Design an evaluation plan (online experiment or quasi-experiment) to measure whether the algorithm reduces stolen posts without hurting the product.

Your plan should include:

A clear primary metric plus diagnostic and guardrail metrics
How you will handle confounding (e.g., seasonality, creator mix shifts, enforcement effects)
How to validate that the metric truly reflects “stolen post” reduction (label quality / delayed labels)
What decision rule you would use to launch/iterate

Detect and evaluate "stolen" posts

1) What additional information do you need?

2) What are drawbacks of your methodology?

3) What harms can stolen posts cause?

4) How do you evaluate the new algorithm’s effectiveness?

Solution

Comments (0)