How would you critique this regression?
Company: Apple
Role: Data Scientist
Category: Statistics & Math
Difficulty: easy
Interview Round: Technical Screen
You are reviewing a modeling workflow from another data scientist.
**Business context**
A website receives traffic from Google Search. Define the response variable as:
- **Y** = time spent on the website after a user arrives from Google Search, measured in seconds during that session.
There are **4 candidate predictor variables** available. Their exact definitions are not provided, so part of the task is to explain what clarifying questions you would ask before approving the analysis.
The other data scientist used the following workflow to build a **linear regression** model:
1. They observed that **Y appears approximately normally distributed**, so they decided that **ordinary least squares (OLS)** was appropriate.
2. They fit **all possible combinations of the 4 predictors**, and also included **all second-order interaction terms**.
3. They chose the model with the **best in-sample fit** as the final model.
**Question**
How would you evaluate this workflow? In your answer:
- Identify which parts of the reasoning are flawed, incomplete, or potentially misleading.
- State what assumptions actually matter for OLS and for statistical inference.
- Explain what additional questions you would ask about the data, the product setting, and the modeling goal.
- Recommend a better modeling and validation approach.
- Discuss issues such as overfitting, interaction selection, multicollinearity, heteroskedasticity, outliers, leakage, dependence across observations, and whether dwell time is even well modeled by plain linear regression.
- If the goal is prediction, explain how you would evaluate model quality. If the goal is inference, explain how your recommendations would differ.
You may assume the sample size is not known, and that the 4 predictors could be a mix of numeric and categorical features.
Quick Answer: This question evaluates a data scientist's grasp of linear regression theory and practice, including OLS assumptions, model selection and validation, interaction effects, multicollinearity, heteroskedasticity, outliers, leakage, dependence across observations, and the distinction between prediction and causal inference.