Compare Random Forests vs Gradient Boosting rigorously

Q: Compare Random Forests vs Gradient Boosting rigorously

This question evaluates a candidate's ability to choose and configure tree-based models (Random Forest vs. Gradient Boosted Trees), handle high-cardinality categorical features and missingness, mitigate class imbalance and label noise, produce reliable feature importance and calibration, and design an experiment and inference strategy that meets strict latency and resource constraints. It is commonly asked in the Machine Learning domain for Data Scientist roles to assess trade-offs in bias–variance, robustness to correlated features and noise, hyperparameter and encoding decisions, experiment design and evaluation, and tests both conceptual understanding and practical application for production-ready systems.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Technical ML Choice: Random Forest vs. Gradient-Boosted Trees for Large-Scale Binary Classification

Problem Setup

You need to choose between a Random Forest (RF) and a Gradient-Boosted Trees model (GBT; e.g., LightGBM/XGBoost) for a production binary classifier with the following characteristics:

Data: 1,000,000 rows; 200 features (≈70% numeric, ≈30% categorical; some categorical features have high cardinality > 1,000)
Missingness: ≈20% values missing
Class imbalance: ≈1:50 positive-to-negative ratio
Label noise: moderate (≈5–10% flipped labels)
Feature correlations: strong
Constraints: strict online prediction latency ≤ 20 ms per example
Resources: training budget 60 minutes on 16 vCPU, 64 GB RAM
Restrictions: no deep learning

Answer all parts precisely:

Select RF or GBT for production and justify using bias–variance trade-offs, robustness to label noise/outliers, interaction modeling capacity, and stability under correlated features. Specify key risks of your choice.
List concrete starting hyperparameters and ranges you would tune for both models (RF: n_estimators, max_depth, max_features, min_samples_leaf, class_weight; GBT: learning_rate, n_estimators, max_depth or num_leaves, subsample, colsample_bytree, min_child_samples, reg_alpha, reg_lambda, scale_pos_weight). Explain expected effects on bias/variance and latency.
Describe how you will encode categorical features (e.g., target encoding with out-of-fold scheme, one-hot, hashing, or native categorical handling) while preventing leakage and preserving latency; include your plan for high-cardinality features.
Explain your strategy for class imbalance (class weights vs. sampling vs. loss weighting) and how you will pick the primary metric (e.g., PR-AUC vs. ROC-AUC) and threshold. Include calibration plans (Platt vs. isotonic) and how to validate calibration.
Outline a 60-minute experiment plan: data split protocol (time-aware or stratified K-fold), feature preprocessing, tuning schedule (coarse-to-fine with early stopping for GBT, OOB-based sanity checks for RF), and guardrails to detect leakage. Provide a minute-by-minute or staged budget and a fallback path if training overruns.
Identify scenarios where RF would likely outperform GBT and vice versa for this dataset. Include how missing value handling, monotonic constraints, correlated features, and distribution shift affect your decision.
Specify how you will produce and validate feature importances (permutation vs. gain), partial dependence/ICE checks, and SHAP analyses, noting pitfalls under correlation and leakage. Finally, detail how you will meet the 20 ms latency budget at inference (e.g., tree depth limits, model compression, batching).

Compare Random Forests vs Gradient Boosting rigorously

Technical ML Choice: Random Forest vs. Gradient-Boosted Trees for Large-Scale Binary Classification

Problem Setup

Solution

Comments (0)

Compare Random Forests vs Gradient Boosting rigorously

Overview

Technical ML Choice: Random Forest vs. Gradient-Boosted Trees for Large-Scale Binary Classification

Problem Setup

Solution

Comments (0)