Measuring Fake-News Prevalence Under Reviewer Constraints
Context
Policy teams need an overnight view of fake-news prevalence on the platform, but only a small number of human reviewers are available. In parallel, leadership wants a long-term, statistically sound measurement program and a plan to improve detection models. Assume you can use an existing ML model to pre-score content as likely fake or not, and you can access impression counts to estimate user exposure.
Report and reason about both:
-
Content prevalence: percentage of posts that are fake.
-
Exposure prevalence: percentage of user impressions on fake content.
Questions
-
Rapid overnight measurement: With limited reviewers, how would you measure fake-news impact within a single day?
-
Hint: Use ML pre-labels plus targeted human sampling.
-
Extrapolation from a sample: A random sample of 1,000 posts shows 10% fake. How would you extrapolate and report platform-level prevalence?
-
Hint: Include confidence intervals and, if sampling is not simple random, a weighted projection.
-
Robust long-term program: With ample resources, design a rigorous approach to quantify fake-news prevalence.
-
Hint: Stratified sampling, user exposure weighting, and propagation/cascade analysis.
-
Model iteration: Your detection model misses fake content. How would you iterate and improve it?
-
Hint: Hard-negative mining, active learning, and ensemble models.