Compare Random Forests vs Gradient Boosting rigorously
Company: Amazon
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
You must choose between a Random Forest (RF) and a Gradient-Boosted Trees model (GBT; e.g., LightGBM/XGBoost) for a binary classification problem with the following characteristics: 1,000,000 rows; 200 features (70% numeric, 30% categorical with some high cardinality > 1,000); 20% missing values; class imbalance 1:50; moderate label noise (estimated 5–10% flipped labels); strong feature correlations; strict online prediction latency budget 20 ms per example; training budget 60 minutes on 16 vCPU, 64 GB RAM; no deep learning allowed.
Answer all parts precisely:
1) Select RF or GBT for production and justify using bias–variance trade-offs, robustness to label noise/outliers, interaction modeling capacity, and stability under correlated features. Specify the key risks of your choice.
2) List concrete starting hyperparameters and ranges you would tune for both models (RF: n_estimators, max_depth, max_features, min_samples_leaf, class_weight; GBT: learning_rate, n_estimators, max_depth or num_leaves, subsample, colsample_bytree, min_child_samples, reg_alpha, reg_lambda, scale_pos_weight). Explain expected effects on bias/variance and latency.
3) Describe how you will encode categorical features (e.g., target encoding with out-of-fold scheme, one-hot, hashing, or native categorical handling) while preventing leakage and preserving latency; include your plan for high-cardinality features.
4) Explain your strategy for class imbalance (class weights vs. sampling vs. loss weighting) and how you will pick the primary metric (e.g., PR-AUC vs. ROC-AUC) and threshold. Include calibration plans (Platt vs. isotonic) and how to validate calibration.
5) Outline a 60-minute experiment plan: data split protocol (time-aware or stratified K-fold), feature preprocessing, tuning schedule (coarse-to-fine with early stopping for GBT, OOB-based sanity checks for RF), and guardrails to detect leakage. Provide a minute-by-minute or staged budget and a fallback path if training overruns.
6) Identify scenarios where RF would likely outperform GBT and vice versa for this dataset. Include how missing value handling, monotonic constraints, correlated features, and distribution shift affect your decision.
7) Specify how you will produce and validate feature importances (permutation vs. gain), partial dependence/ICE checks, and SHAP analyses, noting pitfalls under correlation and leakage. Finally, detail how you will meet the 20 ms latency budget at inference (e.g., tree depth limits, model compression, batching).
Quick Answer: This question evaluates a candidate's ability to choose and configure tree-based models (Random Forest vs. Gradient Boosted Trees), handle high-cardinality categorical features and missingness, mitigate class imbalance and label noise, produce reliable feature importance and calibration, and design an experiment and inference strategy that meets strict latency and resource constraints. It is commonly asked in the Machine Learning domain for Data Scientist roles to assess trade-offs in bias–variance, robustness to correlated features and noise, hyperparameter and encoding decisions, experiment design and evaluation, and tests both conceptual understanding and practical application for production-ready systems.