Critique a Linear Regression Workflow
Company: Apple
Role: Data Scientist
Category: Statistics & Math
Difficulty: easy
Interview Round: Technical Screen
You are reviewing another data scientist's approach to modeling website dwell time for users who arrive from Google Search.
- Response variable: `Y`, the number of seconds a user stays on the website after clicking through from Google Search.
- Candidate predictors: four variables `X1`-`X4` (their exact definitions are not specified and should be clarified).
The other data scientist used the following process:
1. They observed that `Y` appears approximately normally distributed and concluded that ordinary least squares regression is appropriate.
2. They fit all possible combinations of the predictors, including squared terms and pairwise second-order interactions.
3. They selected the model with the best fit as the final model.
Critique this workflow. What clarifying questions would you ask before accepting the analysis, and what would you recommend instead?
In your answer, discuss:
- whether the goal is prediction, inference, or causal estimation,
- which assumptions actually matter for OLS and for valid statistical inference,
- why the marginal normality of `Y` is not one of the Gauss-Markov assumptions,
- how dwell-time data can violate standard linear-model assumptions,
- the risks of exhaustive interaction search and model-selection bias,
- and how you would redesign the modeling and validation process, including model diagnostics, evaluation metrics, cross-validation, and possible alternatives such as transformation, GLMs, regularization, or robust standard errors.
Quick Answer: This question evaluates a candidate's understanding of linear regression assumptions, model-selection pitfalls, diagnostics and validation methods, and the distinction between predictive, inferential, and causal goals when modeling skewed outcome data such as website dwell time.