How would you critique this regression?
Company: Apple
Role: Data Scientist
Category: Statistics & Math
Difficulty: easy
Interview Round: Technical Screen
##### Question
You are reviewing a modeling workflow built by another data scientist and asked to critique it.
**Business context**
A website receives traffic from Google Search. The response variable is:
- **Y** = the number of seconds a user stays on the website after clicking through from Google Search, measured during that session (i.e. session dwell time).
There are **4 candidate predictor variables**, `X1`–`X4`. Their exact definitions are *not* provided (they could be a mix of numeric and categorical features), so part of the task is to explain what you would clarify before approving the analysis.
The other data scientist used the following workflow to build a **linear regression** model:
1. They observed that **Y appears approximately normally distributed**, and concluded that **ordinary least squares (OLS)** was therefore appropriate.
2. They fit **all possible combinations of the 4 predictors**, including **squared (quadratic) terms** and **all pairwise second-order interaction terms**.
3. They chose the model with the **best in-sample fit** as the final model.
**Question**
Critique this workflow. What clarifying questions would you ask before accepting the analysis, and what would you recommend instead? In your answer, address:
1. Whether the goal is **prediction, inference, or causal estimation**, and how that changes the right choices.
2. Which assumptions **actually** matter for OLS and for valid statistical inference — and why the **marginal normality of Y is not one of the Gauss–Markov assumptions**.
3. How **dwell-time data** can violate standard linear-model assumptions (skew, zeros, censoring, outliers, dependence).
4. The risks of the **exhaustive subset + interaction search** and the resulting **model-selection / overfitting bias**, including why "best in-sample fit" is the wrong selection criterion.
5. The diagnostics you would check instead (functional form, heteroskedasticity, multicollinearity/VIF, influence, clustering, leakage).
6. How you would **redesign the modeling and validation process** — baseline model, proper train/validation/test or cross-validation, evaluation metrics, and possible alternatives such as target transformation, GLMs, regularization, robust/clustered standard errors, or tree-based models.
You may assume the sample size is not stated.
Quick Answer: An Apple Data Scientist technical-screen statistics question that asks you to critique another analyst's linear-regression workflow for modeling website dwell time. It tests OLS assumptions (and why marginal normality of Y is not one of them), model-selection and overfitting bias from exhaustive interaction search, diagnostics and validation, and the distinction between prediction, inference, and causal goals on skewed outcome data.