Explain and tune XGBoost; prevent overfitting

Q: Explain and tune XGBoost; prevent overfitting

This is a Machine Learning interview question from TikTok for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

XGBoost Tree Booster: Objective, Hyperparameters, Tuning for Imbalanced Detection, and Post-training Use

Context: You are building a binary classifier with XGBoost (tree booster) to flag “bad sellers.” Positives are rare (≈0.5%). Answer the following:

(a) Objective, Second-Order Approximation, Split Gain, and Regularization Roles

What objective does XGBoost optimize for tree boosting?
How does the second-order Taylor approximation lead to the split "gain" formula?
Explain the roles of lambda (L2), alpha (L1), gamma (min_split_loss), and learning rate (eta) in split selection and pruning.

(b) Impactful Hyperparameters for Tabular Classification

For each hyperparameter, describe expected direction of effect on bias, variance, and training time:

max_depth, max_leaves, min_child_weight, subsample, colsample_bytree/level, eta, n_estimators, lambda, alpha, gamma, max_delta_step, scale_pos_weight, monotone_constraints.

(c) Tuning Plan for 0.5% Positive Rate (“Bad Sellers”)

Specify a data split strategy that avoids leakage (time- and seller-based splits).
Choose the primary offline metric and justify (e.g., PR-AUC).
Show how to set an operating threshold using a cost matrix.
Describe robust early stopping.
List diagnostics/plots to detect overfitting and data leakage.

(d) Post-training Calibration and Interpretation

How to calibrate probabilities.
How to interpret the model for investigators (e.g., SHAP).
How to prevent attackers from reverse-engineering rules while providing useful explanations.