How would you critique this regression?

Q: How would you critique this regression?

This question evaluates a data scientist's grasp of linear regression theory and practice, including OLS assumptions, model selection and validation, interaction effects, multicollinearity, heteroskedasticity, outliers, leakage, dependence across observations, and the distinction between prediction and causal inference.

Q: How do I approach Statistics & Math interview questions?

Statistics & Math questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master statistics & math interviews.

Q: What difficulty level is this interview question?

This is a easy difficulty Statistics & Math question, commonly asked during Technical Screen rounds at Apple.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Apple during technical interviews.

Question

You are reviewing a modeling workflow from another data scientist.

Business context A website receives traffic from Google Search. Define the response variable as:

Y = time spent on the website after a user arrives from Google Search, measured in seconds during that session.

There are 4 candidate predictor variables available. Their exact definitions are not provided, so part of the task is to explain what clarifying questions you would ask before approving the analysis.

The other data scientist used the following workflow to build a linear regression model:

They observed that Y appears approximately normally distributed , so they decided that ordinary least squares (OLS) was appropriate.
They fit all possible combinations of the 4 predictors , and also included all second-order interaction terms .
They chose the model with the best in-sample fit as the final model.

Question How would you evaluate this workflow? In your answer:

Identify which parts of the reasoning are flawed, incomplete, or potentially misleading.
State what assumptions actually matter for OLS and for statistical inference.
Explain what additional questions you would ask about the data, the product setting, and the modeling goal.
Recommend a better modeling and validation approach.
Discuss issues such as overfitting, interaction selection, multicollinearity, heteroskedasticity, outliers, leakage, dependence across observations, and whether dwell time is even well modeled by plain linear regression.
If the goal is prediction, explain how you would evaluate model quality. If the goal is inference, explain how your recommendations would differ.

You may assume the sample size is not known, and that the 4 predictors could be a mix of numeric and categorical features.

How would you critique this regression?

Quick Overview

Solution

Comments (0)