Binary Classification Under Compute and Imbalance Constraints
Context
You are training an XGBoost model for a binary classification problem with:
-
1,000,000 rows, 100 features (20 numeric, 80 categorical that are one‑hot encoded)
-
Positive class rate ≈ 1% (10,000 positives / 990,000 negatives)
-
Hardware: single 16‑core CPU, 32 GB RAM
-
Wall‑clock training time budget: ≤ 5 minutes
Assume you can provide a user_id to group rows (to prevent leakage in validation) and that features may contain missing values (NaNs). The one‑hot columns are sparse 0/1 indicators.
Tasks
-
Propose initial XGBoost hyperparameters (eta/learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, lambda, alpha, n_estimators, max_bin or tree_method) and justify each in terms of bias–variance, class imbalance, and compute constraints.
-
Describe an efficient tuning strategy: search space, early stopping, and a cross‑validation scheme that prevents leakage from users appearing in multiple folds.
-
Explain exactly how XGBoost handles missing values during tree splitting and how that interacts with one‑hot encoding vs target encoding.
-
Given severe minority‑class scarcity, compare using scale_pos_weight vs weighted loss vs focal loss; when would each be preferable?