Explain and tune XGBoost; prevent overfitting
Company: TikTok
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
Explain XGBoost's tree booster in enough detail to answer: (a) What objective does it optimize and how does the second-order Taylor approximation lead to a split "gain" formula? Explain the roles of lambda (L2), alpha (L1), gamma (min_split_loss), and learning rate (eta) in that gain and in pruning. (b) List the most impactful hyperparameters for tabular classification and describe, for each, the expected direction of effect on bias/variance and training time: max_depth, max_leaves, min_child_weight, subsample, colsample_bytree/level, eta, n_estimators, lambda, alpha, gamma, max_delta_step, scale_pos_weight, monotone_constraints. (c) You must train a model to flag "bad sellers" when the positive rate is 0.5%. Design a tuning plan to minimize real business cost: specify data split strategy that avoids leakage (e.g., time- and seller-based splits), the primary offline metric (e.g., PR-AUC), how to choose an operating threshold using a cost matrix, how to apply early stopping robustly, and what diagnostics/plots you would produce to detect overfitting and data leakage. (d) After training, how would you calibrate probabilities and interpret the model for investigators (e.g., SHAP), while preventing attackers from reverse-engineering the rules?
Quick Answer: This question evaluates understanding of XGBoost tree-boosting internals, hyperparameter impacts on bias/variance/training time, strategies for imbalanced binary classification and leakage-aware validation, post-training calibration, and interpretability for tabular fraud detection.