This question evaluates mastery of statistical estimation and sampling theory, including recognition of bias from subsampling, frequency‑of‑frequencies modeling for rare events, uncertainty quantification under heavy‑tailed counts, and design of simulation studies.
A daily search log has one row per query string. You draw a 10% simple random sample of rows without replacement. Define a “unique query” (singleton) as a query appearing exactly once in the full day’s log. a) Explain why estimating the number of singletons by counting singletons in the 10% sample and multiplying by 10 is biased; determine the bias direction and give intuition. b) Derive a better estimator using a frequency‑of‑frequencies model: relate sampled counts f_k to population counts F_k under binomial thinning, and propose a Poisson/negative‑binomial mixture or Good–Turing/Chao‑type estimator for F_1. c) Outline how you would compute standard errors (delta method, bootstrap) and diagnose model misspecification under heavy‑tailed query frequencies. d) Describe a simulation plan to compare estimators across realistic traffic distributions.