Detect and evaluate "stolen" posts

Q: Detect and evaluate "stolen" posts

This question evaluates data-science competencies in instrumentation and feature design, content-similarity detection and labeling, causal inference for experiments, measurement of harms, and model evaluation, and belongs to the Analytics & Experimentation domain.

Q: How do I approach Analytics & Experimentation interview questions?

Analytics & Experimentation questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master analytics & experimentation interviews.

Q: What difficulty level is this interview question?

This is a easy difficulty Analytics & Experimentation question, commonly asked during Technical Screen rounds at Tools For Humanity.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Tools For Humanity during technical interviews.

Question

You are a Data Scientist at a Twitter-like app. The platform suspects an increase in “stolen posts” (users reposting/plagiarizing others’ content without attribution), and a new algorithm has been built to reduce stolen posts.

Answer the following product/DS questions.

1) What additional information do you need?

Given only the basic post table schema (post id/author/time/type/content/parent), list the additional data you would request to reliably determine whether a post is “stolen.”

2) What are drawbacks of your methodology?

Assume you propose a detection methodology (rules, ML, similarity search, etc.). Explain key limitations and failure modes.

3) What harms can stolen posts cause?

Describe potential user, creator, and platform harms. Include at least one harm that affects metrics or model feedback loops.

4) How do you evaluate the new algorithm’s effectiveness?

Design an evaluation plan (online experiment or quasi-experiment) to measure whether the algorithm reduces stolen posts without hurting the product.

Your plan should include:

A clear primary metric plus diagnostic and guardrail metrics
How you will handle confounding (e.g., seasonality, creator mix shifts, enforcement effects)
How to validate that the metric truly reflects “stolen post” reduction (label quality / delayed labels)
What decision rule you would use to launch/iterate

Detect and evaluate "stolen" posts

Quick Overview

1) What additional information do you need?

2) What are drawbacks of your methodology?

3) What harms can stolen posts cause?

4) How do you evaluate the new algorithm’s effectiveness?

Solution

Comments (0)