Diagnose and interpret regression assumptions

Q: Diagnose and interpret regression assumptions

This question evaluates proficiency in regression diagnostics and model selection for count outcomes, including OLS assumption checks, log-transformation back-transformation and coefficient interpretation, heteroskedasticity testing and robust standard errors, multicollinearity (VIF), autocorrelation, and the choice between OLS and Poisson/Negative Binomial GLMs; it falls under Statistics & Math for Data Scientist roles and tests both conceptual understanding and practical application of statistical modeling. Such questions are commonly asked to assess a candidate's ability to validate model assumptions, interpret transformed and categorical effects, and justify appropriate modeling choices based on diagnostic evidence, reflecting the statistical reasoning needed in real-world data science work.

Q: How do I approach Statistics & Math interview questions?

Statistics & Math questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master statistics & math interviews.

Question

Loading...

OLS for Signups with Diagnostics and Alternatives

You are given a cleaned dataset with the following columns:

signups: non-negative integer count target
spend: numeric
clicks: integer
cpc: numeric (cost per click)
region: categorical

Task: Using Python and statsmodels, draw a 100,000-row sample without replacement and fit an OLS model to predict signups using spend, clicks, cpc, and region dummies. Then:

If you use log1p(signups) as the dependent variable, show how to back-transform predictions to the original scale and interpret the spend coefficient.
Check model assumptions using:
- Residuals vs fitted plot
- Q–Q plot
- Breusch–Pagan test for heteroskedasticity
- Durbin–Watson test for autocorrelation
- VIFs to assess multicollinearity
If heteroskedasticity is present, refit with HC3 robust standard errors and comment on how p-values and confidence intervals change.
Report adjusted R², the 95% CI for the spend coefficient, and interpret the region dummy coefficients relative to the baseline.
Explain when a Poisson or Negative Binomial GLM would be preferable for signups and how to test for overdispersion.

Provide minimal code necessary to reproduce these diagnostics.

Diagnose and interpret regression assumptions

OLS for Signups with Diagnostics and Alternatives

Solution

Comments (0)

Diagnose and interpret regression assumptions

Overview

OLS for Signups with Diagnostics and Alternatives

Solution

Comments (0)