XGBoost Tree Booster: Objective, Hyperparameters, Tuning for Imbalanced Detection, and Post-training Use
Context: You are building a binary classifier with XGBoost (tree booster) to flag “bad sellers.” Positives are rare (≈0.5%). Answer the following:
(a) Objective, Second-Order Approximation, Split Gain, and Regularization Roles
-
What objective does XGBoost optimize for tree boosting?
-
How does the second-order Taylor approximation lead to the split "gain" formula?
-
Explain the roles of lambda (L2), alpha (L1), gamma (min_split_loss), and learning rate (eta) in split selection and pruning.
(b) Impactful Hyperparameters for Tabular Classification
For each hyperparameter, describe expected direction of effect on bias, variance, and training time:
-
max_depth, max_leaves, min_child_weight, subsample, colsample_bytree/level, eta, n_estimators, lambda, alpha, gamma, max_delta_step, scale_pos_weight, monotone_constraints.
(c) Tuning Plan for 0.5% Positive Rate (“Bad Sellers”)
-
Specify a data split strategy that avoids leakage (time- and seller-based splits).
-
Choose the primary offline metric and justify (e.g., PR-AUC).
-
Show how to set an operating threshold using a cost matrix.
-
Describe robust early stopping.
-
List diagnostics/plots to detect overfitting and data leakage.
(d) Post-training Calibration and Interpretation
-
How to calibrate probabilities.
-
How to interpret the model for investigators (e.g., SHAP).
-
How to prevent attackers from reverse-engineering rules while providing useful explanations.