Explain Linear Regression Assumptions
Company: Databricks
Role: Data Scientist
Category: Statistics & Math
Difficulty: hard
Interview Round: Technical Screen
Suppose you are using ordinary least squares linear regression to model a continuous business outcome such as weekly user spend from several features, including prior activity, marketing exposure, device type, and region.
Explain the core assumptions behind linear regression and discuss which assumptions matter for:
- unbiased coefficient estimates,
- valid confidence intervals and hypothesis tests,
- and strong predictive performance.
Specifically address the following:
1. What assumptions are typically made in the model `y = X beta + epsilon`?
2. Do the predictors `X` need to be normally distributed?
3. Does the target variable `y` need to be normally distributed?
4. Do the residuals need to be normally distributed, and when does that matter?
5. How would you diagnose problems such as nonlinearity, heteroskedasticity, multicollinearity, autocorrelation, outliers, and omitted-variable bias?
6. If these assumptions are violated, what practical remedies would you consider, such as transformations, interaction terms, splines, robust standard errors, weighted least squares, regularization, generalized linear models, or non-linear models?
7. How do the assumptions differ when the goal is causal interpretation versus pure prediction?
Quick Answer: This question evaluates understanding of ordinary least squares linear regression assumptions and their implications for unbiased coefficient estimates, valid confidence intervals and hypothesis tests, and strong predictive performance, along with competence in diagnosing assumption violations and distinguishing causal inference from prediction goals. It is commonly asked to probe statistical reasoning and model-validity judgment, falls under the Statistics & Math domain, and requires both conceptual understanding of theoretical assumptions and practical application of diagnostic interpretation.