Diagnose and fix multicollinearity in income regression
Company: Databricks
Role: Data Scientist
Category: Statistics & Math
Difficulty: hard
Interview Round: Technical Screen
You want to estimate the relationship between **gender** and **income** using a regression model, while controlling for other covariates such as:
- `age`
- `education` (e.g., degree level or years of schooling)
- `years_since_degree`
Questions:
1. What is multicollinearity, and is it likely to be present among these predictors? Why?
2. How would you diagnose multicollinearity in practice (specific checks/metrics)?
3. If multicollinearity exists, what are practical ways to address it while still answering the question about gender and income? Discuss tradeoffs (interpretability vs prediction).
Quick Answer: This question evaluates understanding of multicollinearity and regression diagnostics within the Statistics & Math domain, targeting applied regression modeling skills at an intermediate data scientist level.