ML Case: Missing Lowest-Income Bracket in California Housing Data
Context
You're building a supervised model (regression) to predict California housing prices using a dataset similar to the classic California Housing data. One key covariate is household income. The training data contains no observations from the lowest-income bracket (< $25k), but the deployed model must perform well across all income ranges, including this unseen bracket at inference time.
Assume the deployment/test distribution will include the full income range, including < $25k. You may optionally have access to unlabeled production covariates (features only) that include the missing bracket.
Task
Design a modeling approach that achieves robust performance across all income ranges, with special attention to the unseen lowest-income bracket. Your answer should cover:
-
Diagnostics: How you’d confirm and quantify the shift and missing support.
-
Modeling strategy: Architectures/algorithms that extrapolate sensibly and incorporate domain knowledge.
-
Distribution shift handling: Methods such as importance weighting, domain adaptation/transfer learning, and data augmentation (if appropriate).
-
Feature scaling and preprocessing choices that help stability.
-
Validation: How you will evaluate performance for the unseen bracket before production, stress tests, and uncertainty estimates.
-
Deployment and incremental retraining plan once data from the missing bracket starts arriving.
You may reference techniques like domain similarity, incremental retraining, covariate shift correction, transfer learning, and feature scaling.