Random Forests, Bagging vs Boosting, and Practical Model Validation
You are building a supervised learning model on tabular data. Explain and compare ensemble methods, evaluation, and validation choices for Random Forests and related approaches.
A. Random Forest Aggregation and Feature Subsampling
-
How does a Random Forest classifier aggregate predictions from bootstrapped decision trees? Describe bootstrapping and the aggregation rule for classification vs regression.
-
How does feature subsampling at each split reduce correlation between trees, and why does that matter for variance reduction?
B. Bagging vs Boosting
-
Conceptually contrast bagging (e.g., Random Forests) with boosting (e.g., XGBoost/LightGBM).
-
Compare them in bias–variance terms and discuss typical overfitting/robustness behavior.
C. Key Hyperparameters and Their Effects
Discuss the following hyperparameters and how they affect bias, variance, computation, and class imbalance handling:
-
n_estimators
-
max_depth
-
max_features (a.k.a. mtry)
-
min_samples_leaf
-
class_weight
D. Out-of-Bag (OOB) Error Estimation
-
What is OOB error and how is it computed?
-
When is OOB reliable, and what are its limitations—especially with heavy class imbalance (e.g., 1% positive rate) or time-series data?
E. Evaluation: Classification vs Regression
-
Recommend metrics for classification (e.g., ROC-AUC, PR-AUC, log loss, accuracy) and explain when accuracy is misleading.
-
Recommend metrics for regression (e.g., RMSE, MAE, R²) and explain when R² is misleading.
F. Handling 1% Positive-Rate Imbalance
Describe practical steps for:
-
Threshold selection (including cost-sensitive thresholds or top-k selection)
-
Cost-sensitive learning (e.g., class_weight)
-
Calibrated probabilities
-
Evaluation with PR curves (and interpretation of baseline)
G. Feature Importance and Pitfalls
Explain:
-
Impurity-based importance and its biases
-
Permutation importance (including OOB or validation-based)
-
Grouped/conditional permutations for correlated features
-
Leakage pitfalls and how to avoid them
H. When Random Forests Underperform vs Gradient Boosting
-
Why and when might Random Forests underperform compared to XGBoost/LightGBM?
-
Provide a scenario where Random Forests are preferable.
I. Concrete Validation Plan for a Tabular Dataset
Provide a step-by-step, reproducible plan to validate a Random Forest, including:
-
Train/validation/test split strategy (i.i.d. vs time-based)
-
Cross-validation setup
-
Early-stopping proxies for Random Forests
-
Threshold tuning, probability calibration, and final evaluation