Model overdispersed counts; estimate treatment lift
Company: TikTok
Role: Data Scientist
Category: Statistics & Math
Difficulty: Medium
Interview Round: Onsite
Weekly posts per creator are overdispersed and zero‑inflated. In a creator‑level randomized test of a nudge:
- Control: n_c=40,000 creators, total posts=72,000 (mean=1.8)
- Treatment: n_t=40,000 creators, total posts=75,600 (mean=1.89)
- Historical control variance per creator s_c^2≈6.5 (suggesting overdispersion).
Answer:
1) Choose an appropriate model (e.g., Negative Binomial with log link). Using var( Y ) = μ + μ^2/k, estimate k from the control statistics and compute the estimated log rate ratio, its standard error, and a 95% CI for the treatment lift.
2) If you instead used a Poisson model, quantify the expected underestimation of SE relative to the NB and discuss when that would inflate Type I error.
3) Outline a cluster‑robust approach if randomization had been by geo (state/clusters), and a nonparametric bootstrap you’d trust here. Be explicit about the resampling unit and how you’d construct the CI for the rate ratio.
4) Given meaningful heterogeneity by creator tenure, propose a pre‑specified analysis (e.g., stratified NB or interaction terms) and how you’d correct for multiple comparisons across 10 geos (e.g., BH‑FDR).
Quick Answer: This question evaluates modeling and inference for overdispersed, zero‑inflated count data, including estimation of treatment lift (rate ratios), dispersion assessment, standard error quantification, cluster-robust inference, bootstrap resampling, and multiple-comparison correction.