This question evaluates a candidate's competency in statistical modeling of count data, resampling and bootstrap inference, summary-statistic interpretation, and numeric aggregation/stability considerations within the Statistics & Math domain for a data scientist role.

Consider the metric comments_per_DAU (number of comments a daily active user makes in a day).
a) Shape: Describe and justify the expected distribution of comments_per_DAU across users on a given day (e.g., zero-inflation, skew/heavy tail). Is the variable discrete or continuous? What are reasonable parametric families to consider (e.g., Poisson vs Negative Binomial), and why might Poisson be inadequate?
b) Bootstrapping: You repeatedly resample n=10,000 users with replacement from that day’s user list and compute the sample mean, repeating this 100,000 times. Describe the bootstrap distribution’s shape and center. Under what conditions will it be approximately normal, and when might it remain skewed? What is the relationship between its standard deviation and the population variance σ²?
c) Scaling n: If you increase n from 10,000 to 20,000, how (quantitatively) does the width of the bootstrap distribution of the mean change? State the factor and the intuition.
d) Summary stats: For this metric, compare mean, median, mode, and p95. Which is most stable, which is most decision-relevant, and why might the mode be 0? How do you interpret and compute p95 for a discrete count variable (e.g., tie handling, integer vs real thresholds)?
e) Data types and aggregation: The per-user value is an integer, but the mean across users is a real number. Explain pitfalls from storing as integer vs float at different aggregation levels (e.g., truncation, rounding bias, overflow) and how you’d ensure numeric stability when computing large-day aggregates.
f) Estimation: Suppose the per-user variance is overdispersed (Var > Mean). Write the approximate standard error of the sample mean and discuss when you’d prefer robust estimators (trimmed mean, Winsorization) or variance reduction techniques (CUPED with a prior-day covariate).