Technical ML Choice: Random Forest vs. Gradient-Boosted Trees for Large-Scale Binary Classification
Problem Setup
You need to choose between a Random Forest (RF) and a Gradient-Boosted Trees model (GBT; e.g., LightGBM/XGBoost) for a production binary classifier with the following characteristics:
-
Data: 1,000,000 rows; 200 features (≈70% numeric, ≈30% categorical; some categorical features have high cardinality > 1,000)
-
Missingness: ≈20% values missing
-
Class imbalance: ≈1:50 positive-to-negative ratio
-
Label noise: moderate (≈5–10% flipped labels)
-
Feature correlations: strong
-
Constraints: strict online prediction latency ≤ 20 ms per example
-
Resources: training budget 60 minutes on 16 vCPU, 64 GB RAM
-
Restrictions: no deep learning
Answer all parts precisely:
-
Select RF or GBT for production and justify using bias–variance trade-offs, robustness to label noise/outliers, interaction modeling capacity, and stability under correlated features. Specify key risks of your choice.
-
List concrete starting hyperparameters and ranges you would tune for both models (RF: n_estimators, max_depth, max_features, min_samples_leaf, class_weight; GBT: learning_rate, n_estimators, max_depth or num_leaves, subsample, colsample_bytree, min_child_samples, reg_alpha, reg_lambda, scale_pos_weight). Explain expected effects on bias/variance and latency.
-
Describe how you will encode categorical features (e.g., target encoding with out-of-fold scheme, one-hot, hashing, or native categorical handling) while preventing leakage and preserving latency; include your plan for high-cardinality features.
-
Explain your strategy for class imbalance (class weights vs. sampling vs. loss weighting) and how you will pick the primary metric (e.g., PR-AUC vs. ROC-AUC) and threshold. Include calibration plans (Platt vs. isotonic) and how to validate calibration.
-
Outline a 60-minute experiment plan: data split protocol (time-aware or stratified K-fold), feature preprocessing, tuning schedule (coarse-to-fine with early stopping for GBT, OOB-based sanity checks for RF), and guardrails to detect leakage. Provide a minute-by-minute or staged budget and a fallback path if training overruns.
-
Identify scenarios where RF would likely outperform GBT and vice versa for this dataset. Include how missing value handling, monotonic constraints, correlated features, and distribution shift affect your decision.
-
Specify how you will produce and validate feature importances (permutation vs. gain), partial dependence/ICE checks, and SHAP analyses, noting pitfalls under correlation and leakage. Finally, detail how you will meet the 20 ms latency budget at inference (e.g., tree depth limits, model compression, batching).