PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Amazon

Compare Random Forests vs Gradient Boosting rigorously

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's ability to choose and configure tree-based models (Random Forest vs. Gradient Boosted Trees), handle high-cardinality categorical features and missingness, mitigate class imbalance and label noise, produce reliable feature importance and calibration, and design an experiment and inference strategy that meets strict latency and resource constraints. It is commonly asked in the Machine Learning domain for Data Scientist roles to assess trade-offs in bias–variance, robustness to correlated features and noise, hyperparameter and encoding decisions, experiment design and evaluation, and tests both conceptual understanding and practical application for production-ready systems.

  • hard
  • Amazon
  • Machine Learning
  • Data Scientist

Compare Random Forests vs Gradient Boosting rigorously

Company: Amazon

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

You must choose between a Random Forest (RF) and a Gradient-Boosted Trees model (GBT; e.g., LightGBM/XGBoost) for a binary classification problem with the following characteristics: 1,000,000 rows; 200 features (70% numeric, 30% categorical with some high cardinality > 1,000); 20% missing values; class imbalance 1:50; moderate label noise (estimated 5–10% flipped labels); strong feature correlations; strict online prediction latency budget 20 ms per example; training budget 60 minutes on 16 vCPU, 64 GB RAM; no deep learning allowed. Answer all parts precisely: 1) Select RF or GBT for production and justify using bias–variance trade-offs, robustness to label noise/outliers, interaction modeling capacity, and stability under correlated features. Specify the key risks of your choice. 2) List concrete starting hyperparameters and ranges you would tune for both models (RF: n_estimators, max_depth, max_features, min_samples_leaf, class_weight; GBT: learning_rate, n_estimators, max_depth or num_leaves, subsample, colsample_bytree, min_child_samples, reg_alpha, reg_lambda, scale_pos_weight). Explain expected effects on bias/variance and latency. 3) Describe how you will encode categorical features (e.g., target encoding with out-of-fold scheme, one-hot, hashing, or native categorical handling) while preventing leakage and preserving latency; include your plan for high-cardinality features. 4) Explain your strategy for class imbalance (class weights vs. sampling vs. loss weighting) and how you will pick the primary metric (e.g., PR-AUC vs. ROC-AUC) and threshold. Include calibration plans (Platt vs. isotonic) and how to validate calibration. 5) Outline a 60-minute experiment plan: data split protocol (time-aware or stratified K-fold), feature preprocessing, tuning schedule (coarse-to-fine with early stopping for GBT, OOB-based sanity checks for RF), and guardrails to detect leakage. Provide a minute-by-minute or staged budget and a fallback path if training overruns. 6) Identify scenarios where RF would likely outperform GBT and vice versa for this dataset. Include how missing value handling, monotonic constraints, correlated features, and distribution shift affect your decision. 7) Specify how you will produce and validate feature importances (permutation vs. gain), partial dependence/ICE checks, and SHAP analyses, noting pitfalls under correlation and leakage. Finally, detail how you will meet the 20 ms latency budget at inference (e.g., tree depth limits, model compression, batching).

Quick Answer: This question evaluates a candidate's ability to choose and configure tree-based models (Random Forest vs. Gradient Boosted Trees), handle high-cardinality categorical features and missingness, mitigate class imbalance and label noise, produce reliable feature importance and calibration, and design an experiment and inference strategy that meets strict latency and resource constraints. It is commonly asked in the Machine Learning domain for Data Scientist roles to assess trade-offs in bias–variance, robustness to correlated features and noise, hyperparameter and encoding decisions, experiment design and evaluation, and tests both conceptual understanding and practical application for production-ready systems.

Related Interview Questions

  • Predicting the Next Elevator Call Location - Amazon (medium)
  • Explain Transformer and MoE Fundamentals - Amazon (medium)
  • Explain Core ML Interview Concepts - Amazon (hard)
  • Evaluate NLP Classification Models - Amazon (easy)
  • Explain overfitting, regularization, and LLM techniques - Amazon (medium)
Amazon logo
Amazon
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Machine Learning
4
0

Technical ML Choice: Random Forest vs. Gradient-Boosted Trees for Large-Scale Binary Classification

Problem Setup

You need to choose between a Random Forest (RF) and a Gradient-Boosted Trees model (GBT; e.g., LightGBM/XGBoost) for a production binary classifier with the following characteristics:

  • Data: 1,000,000 rows; 200 features (≈70% numeric, ≈30% categorical; some categorical features have high cardinality > 1,000)
  • Missingness: ≈20% values missing
  • Class imbalance: ≈1:50 positive-to-negative ratio
  • Label noise: moderate (≈5–10% flipped labels)
  • Feature correlations: strong
  • Constraints: strict online prediction latency ≤ 20 ms per example
  • Resources: training budget 60 minutes on 16 vCPU, 64 GB RAM
  • Restrictions: no deep learning

Answer all parts precisely:

  1. Select RF or GBT for production and justify using bias–variance trade-offs, robustness to label noise/outliers, interaction modeling capacity, and stability under correlated features. Specify key risks of your choice.
  2. List concrete starting hyperparameters and ranges you would tune for both models (RF: n_estimators, max_depth, max_features, min_samples_leaf, class_weight; GBT: learning_rate, n_estimators, max_depth or num_leaves, subsample, colsample_bytree, min_child_samples, reg_alpha, reg_lambda, scale_pos_weight). Explain expected effects on bias/variance and latency.
  3. Describe how you will encode categorical features (e.g., target encoding with out-of-fold scheme, one-hot, hashing, or native categorical handling) while preventing leakage and preserving latency; include your plan for high-cardinality features.
  4. Explain your strategy for class imbalance (class weights vs. sampling vs. loss weighting) and how you will pick the primary metric (e.g., PR-AUC vs. ROC-AUC) and threshold. Include calibration plans (Platt vs. isotonic) and how to validate calibration.
  5. Outline a 60-minute experiment plan: data split protocol (time-aware or stratified K-fold), feature preprocessing, tuning schedule (coarse-to-fine with early stopping for GBT, OOB-based sanity checks for RF), and guardrails to detect leakage. Provide a minute-by-minute or staged budget and a fallback path if training overruns.
  6. Identify scenarios where RF would likely outperform GBT and vice versa for this dataset. Include how missing value handling, monotonic constraints, correlated features, and distribution shift affect your decision.
  7. Specify how you will produce and validate feature importances (permutation vs. gain), partial dependence/ICE checks, and SHAP analyses, noting pitfalls under correlation and leakage. Finally, detail how you will meet the 20 ms latency budget at inference (e.g., tree depth limits, model compression, batching).

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Amazon•More Data Scientist•Amazon Data Scientist•Amazon Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.