Count Data Modeling
Asked of: Data Scientist
Last updated

What's being tested
Ability to choose and justify appropriate regression models for nonnegative integer outcomes: diagnosing dispersion/zeros, selecting Poisson/NB/zero-inflated/hurdle, and interpreting rate parameters and offsets.
Core knowledge
- Poisson GLM: log link, assumes mean = variance (equidispersion), coefficients are log-rate ratios.
- Negative Binomial (NB): adds dispersion parameter α, models overdispersion (Var = μ + αμ^2).
- Quasi-Poisson: adjusts SEs for overdispersion without changing mean structure.
- Zero-inflated vs hurdle: ZIP/ZINB mix binary zero-process + count-process; hurdle models separate zero vs positive counts.
- Diagnostics: compare mean vs variance, deviance/df, Pearson chi-square/df, and likelihood/AIC/Vuong tests.
- Offsets: include log(exposure) as offset for rate modeling (events per unit time/user).
- Correlated counts: use mixed-effects Poisson/NB or GEE for clustered/panel data.
Worked example — "Modeling overdispersed count data with many zeros"
First, frame the goal: prediction vs causal inference dictates complexity and interpretability needs. Exploratory checks: compute mean and variance, histogram of counts, proportion of zeros. Fit a Poisson GLM with relevant covariates and log(exposure) offset; examine deviance/df and Pearson chi-square/df for overdispersion. If overdispersion present, fit NB; if excess zeros remain, compare NB vs ZINB/hurdle via AIC and Vuong test, and choose simpler model unless zero-generation mechanism demands two-part modeling.
A common pitfall
The tempting shortcut is to log-transform counts and run OLS; this biases estimates (zeros problematic) and misstates residuals and heteroskedasticity. Another frequent error is automatically choosing a zero-inflated model whenever zeros are common—many datasets have excess zeros explainable by covariates and overdispersion alone, so unnecessary two-part models add complexity and identification issues.
Further reading
- Cameron, A.C. & Trivedi, P.K., "Regression Analysis of Count Data" — comprehensive theory and examples.
- Hilbe, J.M., "Negative Binomial Regression" — practical guidance and diagnostics.