Compare bagging vs boosting on imbalanced data

Q: Compare bagging vs boosting on imbalanced data

This is a Machine Learning interview question from TikTok for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Fraud Detection on 10M Time-Ordered Transactions (0.5% Fraud)

You are building a binary classifier to detect 0.5% fraudulent events among 10,000,000 time-ordered transactions with 300 features (100 numeric, 200 one-hot). You must choose between a bagged Random Forest and a Gradient Boosting model (e.g., XGBoost or LightGBM).

Address the following:

Model choice
- Which would you try first (Random Forest vs Gradient Boosting) and why?
- Reference: bias–variance trade-offs, margins, and how each method reacts to label noise on the minority class.
Class imbalance and metric
- How will you handle class imbalance (class weights vs downsampling the majority vs SMOTE/SMOTE-ENN)?
- Which primary metric will you optimize (PR-AUC vs ROC-AUC) and why?
Hyperparameter grids (initial, concrete values and expected effects)
- Random Forest: n_estimators, max_depth, max_features, class_weight.
- Gradient Boosting (XGBoost/LightGBM): learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight.
Validation plan (avoid leakage)
- Describe a time-based blocked cross-validation plan and grouping by user_id.
- How you will use early stopping (for boosting) or OOB estimates (for RF).
Failure modes
- Two cases where boosting would underperform bagging on this task, and how you would diagnose them with plots/diagnostics.

Compare bagging vs boosting on imbalanced data

Fraud Detection on 10M Time-Ordered Transactions (0.5% Fraud)

Solution (Locked)

Comments (0)