Model Robustness, Diagnostics, Random Forests, and Large-Scale Regression
Context
You are building and evaluating a supervised model to predict residential house prices in a city. Address the following topics about linear models, Random Forests, feature engineering, and large-scale training.
Tasks
-
Linear regression diagnostics
-
How do you detect and handle outliers and influential points?
-
Explain Cook's distance and high-leverage points. How are they computed and interpreted?
-
Random Forests
-
How can you prune trees (or otherwise control complexity) in Random Forests?
-
How do you compute and interpret variable importance?
-
City house-price prediction framework
-
Design a modeling framework to predict a city's house prices. Which factors/features would you include and why?
-
Large-scale linear regression
-
When the number of predictors is large and data do not fit in memory, how can you compute or update the β coefficients in mini-batches without loading all data at once?
Hints
-
Use leverage–residual plots, robust loss, OOB permutation importance, and incremental least squares or SGD.