Explain and tune decision trees robustly
Company: Point72
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Take-home Project
You built a decision tree in an internship. Answer the following, with crisp formulas and procedures:
1) Explain how a CART decision tree selects splits for classification vs. regression (impurity/variance criteria), including exact formulas for Gini, entropy, and MSE, and how surrogate splits work when features have missing values.
2) Give a defensible procedure to choose max_depth and min_samples_split: define a cross‑validation plan, early‑stopping/pruning (cost‑complexity α path), and the metric you would optimize under severe class imbalance (justify PR‑AUC vs. ROC‑AUC vs. F1). Include how you would pick α from the CCP path without leakage.
3) Overfitting checks: specify at least three diagnostics (e.g., cross‑validated gap vs. training, learning curves, permutation importance stability, calibration curves). What patterns flag overfitting for trees specifically?
4) With ~500k rows, ~300 features including high‑cardinality categoricals and sparse indicators, propose a preprocessing + modeling plan using a single decision tree: encoding choice, handling rare categories, monotonic constraints (if any), feature binning, and computational cost. Provide concrete hyperparameter ranges and expected training time order‑of‑magnitude.
5) If you could revisit the project, when would a random forest or a gradient‑boosted tree (e.g., XGBoost/LightGBM) outperform a single tree on this data? Name at least three data/target conditions and the trade‑offs (variance, interpretability, latency, OOB vs. CV, calibration). How would you compare models fairly (data splits, nested CV, fixed preprocessing, and identical evaluation protocol)?
Quick Answer: The question evaluates a candidate's understanding of CART decision tree mechanics, split criteria and surrogate splits for missing values, hyperparameter tuning and pruning, overfitting diagnostics, preprocessing for large feature sets and high‑cardinality categoricals, and criteria for choosing ensemble methods within the Machine Learning domain.