Random Forest — Rigor and Practical Choices
Context: You are building a binary classifier with a Random Forest. The dataset has 100,000 rows, 100 features, and a 5% positive rate. Answer the following:
-
Sources of Randomness
-
Enumerate the sources of randomness in Random Forests (e.g., bootstrap sampling, feature subsampling at each split, random tie-breaking, randomized split points). For each, explain its typical effect on bias and variance.
-
Hyperparameters for the Given Dataset
-
Propose reasonable values for n_estimators, max_depth, and max_features for the dataset above.
-
Justify how max_features controls correlation among trees and the bias–variance trade-off.
-
OOB Error vs 5-Fold Cross-Validation
-
Compare out-of-bag (OOB) error with 5-fold cross-validation (CV).
-
When can they disagree and why?
-
Feature Importance Bias
-
Explain why impurity-based importances are biased toward continuous or high-cardinality features.
-
Propose a corrected approach (e.g., permutation importance with stratified shuffles and repeated runs) and justify your design.
-
Class Imbalance Strategies
-
Outline strategies for class imbalance (e.g., class_weight, threshold moving, balanced subsampling).
-
Discuss consequences for probability calibration and decision thresholds.