High Collinearity in Binary Classification: VIF, SHAP, and Interpretation Strategy
You are modeling a binary outcome Y. Two numeric features A and B are highly correlated: corr(A, B) = 0.98. Other features exist but are only weakly correlated with A and B. You fit:
-
(i) a logistic regression (GLM), and
-
(ii) a gradient-boosted tree model (GBDT).
Answer the following:
-
Variance Inflation Factor (VIF)
-
Compute/estimate the VIF for A and B given corr(A, B) = 0.98 (assume other features add little to R²).
-
Interpret typical VIF/tolerance thresholds that indicate problematic multicollinearity and what that means for logistic regression coefficients.
-
SHAP with near-duplicate features
-
Explain how SHAP values behave when A and B are near-duplicates under interventional vs conditional SHAP formulations.
-
Why can attributions be unstable or split unpredictably across A and B (for both GLM and GBDT)?
-
Defensible interpretation workflow
-
Propose a workflow to interpret such models: e.g., feature clustering and grouped SHAP, permutation importance conditional on the other feature, and re-fitting after removing one of the pair.
-
Describe the diagnostics you expect to see if A and B are redundant.
-
Modeling recommendations and validation
-
Recommend modeling changes (e.g., elastic net for GLM; feature grouping or regularization choices for trees) to handle collinearity.
-
Describe how you would validate that both interpretability and predictive performance remain acceptable.