PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches
|Home/Machine Learning/Citibank

Diagnose and fix linear regression assumption breaks

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a data scientist's competency in linear regression diagnostics and remedial modeling, covering core OLS assumptions (linearity, no perfect multicollinearity, exogeneity, homoskedasticity, error independence and normality), diagnostics and remedies for heteroskedasticity and severe multicollinearity, and method selection among ridge/LASSO and GLMs. Commonly asked in the Machine Learning and statistical modeling domain because it probes detection of assumption violations, interpretation of impacts on standard errors, confidence intervals and hypothesis tests, and the balance between conceptual inference and practical model refitting, assessing both conceptual understanding and practical application.

  • medium
  • Citibank
  • Machine Learning
  • Data Scientist

Diagnose and fix linear regression assumption breaks

Company: Citibank

Role: Data Scientist

Category: Machine Learning

Difficulty: medium

Interview Round: Take-home Project

List the standard OLS assumptions needed for unbiased, efficient, and consistent linear regression estimates (linearity/correct specification, no perfect multicollinearity, exogeneity E[ε|X]=0, homoskedasticity, no autocorrelation/independence, and normality only for exact finite-sample inference). For each assumption, name one concrete diagnostic (e.g., residual plots, VIF, White/Breusch–Pagan, Durbin–Watson, RESET) and one remedy (e.g., transformations, robust/clustered SEs, regularization, GLMs). Given n=10,000 with Var(ε|X) ∝ x1^2 and corr(x2,x3)=0.98, outline your exact steps to validate, refit, and compare models (changes to standard errors, CIs, and hypothesis tests). When would you switch from OLS to ridge/LASSO or a GLM, and why?

Quick Answer: This question evaluates a data scientist's competency in linear regression diagnostics and remedial modeling, covering core OLS assumptions (linearity, no perfect multicollinearity, exogeneity, homoskedasticity, error independence and normality), diagnostics and remedies for heteroskedasticity and severe multicollinearity, and method selection among ridge/LASSO and GLMs. Commonly asked in the Machine Learning and statistical modeling domain because it probes detection of assumption violations, interpretation of impacts on standard errors, confidence intervals and hypothesis tests, and the balance between conceptual inference and practical model refitting, assessing both conceptual understanding and practical application.

Related Interview Questions

  • Handle missing values for LGD modeling - Citibank (medium)
  • Discuss logistic regression limitations for PD - Citibank (medium)
  • Identify top exposures and mitigate - Citibank (medium)
  • Compute EL and RWA from loan data - Citibank (medium)
  • Explain PD model validation steps - Citibank (medium)
Citibank logo
Citibank
Oct 13, 2025, 9:49 PM
Data Scientist
Take-home Project
Machine Learning
4
0

OLS Assumptions, Diagnostics, Remedies, and Refitting Under Heteroskedasticity and Multicollinearity

You are fitting a linear regression with Ordinary Least Squares (OLS) on a large cross-sectional dataset (n = 10,000). Answer the following:

1) Core OLS Assumptions

List the standard OLS assumptions required for unbiased, efficient, and consistent estimates:

  • Linearity / correct specification
  • No perfect multicollinearity
  • Exogeneity: E[ε | X] = 0
  • Homoskedasticity (constant variance)
  • No autocorrelation / independence of errors
  • Normality of errors (only needed for exact finite-sample t/F inference)

For each assumption, provide:

  • One concrete diagnostic
  • One concrete remedy

Examples of diagnostics: residual plots, VIF, White/Breusch–Pagan, Durbin–Watson, RESET. Examples of remedies: transformations, robust/clustered SEs, regularization, GLMs.

2) Scenario: Heteroskedasticity and Multicollinearity

Given:

  • n = 10,000
  • Var(ε | X) ∝ x1² (heteroskedasticity driven by x1)
  • corr(x2, x3) = 0.98 (severe multicollinearity)

Outline exact steps to:

  • Validate assumptions
  • Refit models with appropriate fixes
  • Compare models
  • Describe expected changes to standard errors, confidence intervals, and hypothesis tests

3) Method Choice

When would you switch from OLS to ridge/LASSO or to a GLM, and why?

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Citibank•More Data Scientist•Citibank Data Scientist•Citibank Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.