Modeling 7-day Retention with LPM and Logistic Regression
Context
You have user-level data with a binary outcome retained_7d (1 if the user is active 7 days after signup; 0 otherwise). Covariates include:
-
treated: 1 if the user was exposed to a new preloading feature; 0 otherwise
-
watch_time_day1: minutes watched on day 1 (continuous)
-
new_user: 1 if a new user segment; 0 otherwise
-
Fixed effects (FE) for country and signup_date
The model includes an interaction term treated × new_user.
Answer the following:
(a) Model specification and interpretation
-
Write the Linear Probability Model (LPM) and the logistic regression specification (with fixed effects and the interaction).
-
Interpret the coefficients on treated and on treated × new_user for both models.
-
For the logistic model, translate a coefficient into an odds ratio and then into an approximate marginal effect at the mean.
(b) Assumptions, diagnostics, and standard errors
Discuss assumptions and diagnostics you would check:
-
Heteroskedasticity
-
Separation (logit)
-
Multicollinearity (e.g., VIF)
-
Misspecification (e.g., link test, functional form)
-
Calibration (e.g., reliability plots)
What standard errors would you report and why (HC-robust vs clustered)? Justify your cluster choice.
(c) Endogeneity of watch_time_day1
Suppose watch_time_day1 is endogenous (e.g., driven by unobserved preference). Propose two remedies and their assumptions:
-
Control function/IV approach
-
Panel fixed effects using within-user variation
What instruments or proxies could be plausible here?
(d) Count outcomes for daily videos watched
You also have count data for daily videos watched. When would Poisson or negative binomial be preferred over OLS? How would you check overdispersion and interpret the exponentiated coefficients?