PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Statistics & Math/Apple

How would you critique this regression?

Last updated: Jun 15, 2026

Quick Overview

An Apple Data Scientist technical-screen statistics question that asks you to critique another analyst's linear-regression workflow for modeling website dwell time. It tests OLS assumptions (and why marginal normality of Y is not one of them), model-selection and overfitting bias from exhaustive interaction search, diagnostics and validation, and the distinction between prediction, inference, and causal goals on skewed outcome data.

  • easy
  • Apple
  • Statistics & Math
  • Data Scientist

How would you critique this regression?

Company: Apple

Role: Data Scientist

Category: Statistics & Math

Difficulty: easy

Interview Round: Technical Screen

##### Question You are reviewing a modeling workflow built by another data scientist and asked to critique it. **Business context** A website receives traffic from Google Search. The response variable is: - **Y** = the number of seconds a user stays on the website after clicking through from Google Search, measured during that session (i.e. session dwell time). There are **4 candidate predictor variables**, `X1`–`X4`. Their exact definitions are *not* provided (they could be a mix of numeric and categorical features), so part of the task is to explain what you would clarify before approving the analysis. The other data scientist used the following workflow to build a **linear regression** model: 1. They observed that **Y appears approximately normally distributed**, and concluded that **ordinary least squares (OLS)** was therefore appropriate. 2. They fit **all possible combinations of the 4 predictors**, including **squared (quadratic) terms** and **all pairwise second-order interaction terms**. 3. They chose the model with the **best in-sample fit** as the final model. **Question** Critique this workflow. What clarifying questions would you ask before accepting the analysis, and what would you recommend instead? In your answer, address: 1. Whether the goal is **prediction, inference, or causal estimation**, and how that changes the right choices. 2. Which assumptions **actually** matter for OLS and for valid statistical inference — and why the **marginal normality of Y is not one of the Gauss–Markov assumptions**. 3. How **dwell-time data** can violate standard linear-model assumptions (skew, zeros, censoring, outliers, dependence). 4. The risks of the **exhaustive subset + interaction search** and the resulting **model-selection / overfitting bias**, including why "best in-sample fit" is the wrong selection criterion. 5. The diagnostics you would check instead (functional form, heteroskedasticity, multicollinearity/VIF, influence, clustering, leakage). 6. How you would **redesign the modeling and validation process** — baseline model, proper train/validation/test or cross-validation, evaluation metrics, and possible alternatives such as target transformation, GLMs, regularization, robust/clustered standard errors, or tree-based models. You may assume the sample size is not stated.

Quick Answer: An Apple Data Scientist technical-screen statistics question that asks you to critique another analyst's linear-regression workflow for modeling website dwell time. It tests OLS assumptions (and why marginal normality of Y is not one of them), model-selection and overfitting bias from exhaustive interaction search, diagnostics and validation, and the distinction between prediction, inference, and causal goals on skewed outcome data.

Related Interview Questions

  • Compare Normal vs Poisson; test dispersion and approximate tails - Apple (Medium)
  • Differentiate P-value and Confidence Interval in Statistics - Apple (medium)
  • Write the logistic regression loss function - Apple (Easy)
  • Compare Normal and Poisson Distributions in Statistics - Apple (medium)
Apple logo
Apple
Jan 8, 2026, 12:00 AM
Data Scientist
Technical Screen
Statistics & Math
11
0
Question

You are reviewing a modeling workflow built by another data scientist and asked to critique it.

Business context

A website receives traffic from Google Search. The response variable is:

  • Y = the number of seconds a user stays on the website after clicking through from Google Search, measured during that session (i.e. session dwell time).

There are 4 candidate predictor variables, X1–X4. Their exact definitions are not provided (they could be a mix of numeric and categorical features), so part of the task is to explain what you would clarify before approving the analysis.

The other data scientist used the following workflow to build a linear regression model:

  1. They observed that Y appears approximately normally distributed , and concluded that ordinary least squares (OLS) was therefore appropriate.
  2. They fit all possible combinations of the 4 predictors , including squared (quadratic) terms and all pairwise second-order interaction terms .
  3. They chose the model with the best in-sample fit as the final model.

Question

Critique this workflow. What clarifying questions would you ask before accepting the analysis, and what would you recommend instead? In your answer, address:

  1. Whether the goal is prediction, inference, or causal estimation , and how that changes the right choices.
  2. Which assumptions actually matter for OLS and for valid statistical inference — and why the marginal normality of Y is not one of the Gauss–Markov assumptions .
  3. How dwell-time data can violate standard linear-model assumptions (skew, zeros, censoring, outliers, dependence).
  4. The risks of the exhaustive subset + interaction search and the resulting model-selection / overfitting bias , including why "best in-sample fit" is the wrong selection criterion.
  5. The diagnostics you would check instead (functional form, heteroskedasticity, multicollinearity/VIF, influence, clustering, leakage).
  6. How you would redesign the modeling and validation process — baseline model, proper train/validation/test or cross-validation, evaluation metrics, and possible alternatives such as target transformation, GLMs, regularization, robust/clustered standard errors, or tree-based models.

You may assume the sample size is not stated.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Statistics & Math•More Apple•More Data Scientist•Apple Data Scientist•Apple Statistics & Math•Data Scientist Statistics & Math
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.