Explain random forests, bagging, and evaluation
Company: Amazon
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Onsite
Explain how a Random Forest classifier aggregates bootstrapped decision trees and how feature subsampling reduces correlation. Contrast bagging with boosting conceptually and in bias–variance terms. Cover:
- Key hyperparameters (n_estimators, max_depth, max_features, min_samples_leaf, class_weight) and their effects.
- Out-of-bag (OOB) error estimation: what it is, when it’s reliable, and limitations with heavy class imbalance or time-series.
- Classification vs regression evaluation: choose metrics for each (e.g., ROC-AUC, PR-AUC, log loss, RMSE) and when accuracy or R^2 is misleading.
- Handling 1% positive-rate imbalance: threshold selection, cost-sensitive learning, calibrated probabilities, and evaluation on PR curves.
- Feature importance: impurity-based biases, permutation importance, grouped permutations, and leakage pitfalls.
- When RF underperforms vs gradient boosting (e.g., XGBoost/LightGBM) and why; give a scenario where RF is preferable.
Provide a concrete plan to validate a Random Forest on a tabular dataset, including cross-validation, early stopping proxies, and a reproducible train/validation/test split.
Quick Answer: This question evaluates understanding of ensemble learning and model evaluation, covering Random Forest aggregation, feature subsampling, bagging versus boosting, hyperparameter effects, out-of-bag (OOB) error, class-imbalance handling, feature-importance interpretation, and validation strategies for supervised tabular data in Machine Learning and Data Science. It is commonly asked to assess a candidate's reasoning about bias–variance trade-offs, robustness and evaluation choices for production-ready models; the domain is ensemble methods and model evaluation and the required level combines conceptual understanding with practical application.