Estimate population singletons from a 10% log

Q: Estimate population singletons from a 10% log

This is a Statistics & Math interview question from Google for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Statistics & Math interview questions?

Statistics & Math questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master statistics & math interviews.

Question

A daily search log has one row per query string. You draw a 10% simple random sample of rows without replacement. Define a “unique query” (singleton) as a query appearing exactly once in the full day’s log. a) Explain why estimating the number of singletons by counting singletons in the 10% sample and multiplying by 10 is biased; determine the bias direction and give intuition. b) Derive a better estimator using a frequency‑of‑frequencies model: relate sampled counts f_k to population counts F_k under binomial thinning, and propose a Poisson/negative‑binomial mixture or Good–Turing/Chao‑type estimator for F_1. c) Outline how you would compute standard errors (delta method, bootstrap) and diagnose model misspecification under heavy‑tailed query frequencies. d) Describe a simulation plan to compare estimators across realistic traffic distributions.

Estimate population singletons from a 10% log

Comments (0)