Scenario
Product-facing data science interview on choosing and configuring tree-based ensemble models for tabular prediction in a production setting.
Question
Compare Random Forests (RF) with Gradient Boosted Decision Trees (GBDT), such as XGBoost.
-
What are the key differences in how they learn and generalize (bias–variance, overfitting control, interpretability, training/inference parallelism)?
-
In production, when would you prefer one over the other?
-
Do tree-based models require feature scaling or normalization? Explain the theoretical reason and any practical exceptions.
Hints
-
Bias–variance trade-off, robustness to noise
-
Overfitting control: bagging vs sequential boosting and regularization
-
Interpretability options and stability
-
Parallelism: independent trees vs sequential boosting, GPU/CPU considerations
-
Split criteria and invariance to feature scale