Select and tune XGBoost hyperparameters
Company: OneMain Financial
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
You have a binary classification dataset with 1,000,000 rows, 100 features (20 numeric, 80 categorical one-hot encoded), and a positive class rate of 1%. Training must finish in ≤5 minutes on a single 16-core CPU with 32 GB RAM. 1) Propose initial XGBoost hyperparameters (eta/learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, lambda, alpha, n_estimators, max_bin or tree_method) and justify each in terms of bias–variance, class imbalance, and compute constraints. 2) Describe an efficient tuning strategy (search space, early stopping, cross-validation scheme that prevents leakage from users appearing in multiple folds). 3) Explain exactly how XGBoost handles missing values during tree splitting and how that interacts with one-hot encoding vs target encoding. 4) Given severe minority-class scarcity, compare using scale_pos_weight vs weighted loss vs focal loss; when would each be preferable?
Quick Answer: This question evaluates skills in selecting and tuning XGBoost hyperparameters, managing severe class imbalance and sparse one‑hot encodings, handling missing values, and designing compute‑efficient training and grouped cross‑validation to prevent user‑level leakage.