Compare bagging vs boosting on imbalanced data

Q: Compare bagging vs boosting on imbalanced data

This question evaluates competency in ensemble model selection (bagging vs boosting), handling extreme class imbalance, metric selection, hyperparameter tuning, time-aware validation, and failure-mode diagnosis for large-scale, time-ordered binary classification, and is commonly asked because it probes bias–variance trade-offs, robustness to label noise on the minority class, and practical evaluation and leakage-avoidance concerns in Machine Learning. It assesses both conceptual understanding of algorithmic trade-offs and statistical robustness as well as practical application skills such as designing time-blocked cross-validation, selecting appropriate metrics for imbalanced data, specifying hyperparameter grids, and interpreting diagnostic plots.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Fraud Detection on 10M Time-Ordered Transactions (0.5% Fraud)

You are building a binary classifier to detect 0.5% fraudulent events among 10,000,000 time-ordered transactions with 300 features (100 numeric, 200 one-hot). You must choose between a bagged Random Forest and a Gradient Boosting model (e.g., XGBoost or LightGBM).

Address the following:

Model choice
- Which would you try first (Random Forest vs Gradient Boosting) and why?
- Reference: bias–variance trade-offs, margins, and how each method reacts to label noise on the minority class.
Class imbalance and metric
- How will you handle class imbalance (class weights vs downsampling the majority vs SMOTE/SMOTE-ENN)?
- Which primary metric will you optimize (PR-AUC vs ROC-AUC) and why?
Hyperparameter grids (initial, concrete values and expected effects)
- Random Forest: n_estimators, max_depth, max_features, class_weight.
- Gradient Boosting (XGBoost/LightGBM): learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight.
Validation plan (avoid leakage)
- Describe a time-based blocked cross-validation plan and grouping by user_id.
- How you will use early stopping (for boosting) or OOB estimates (for RF).
Failure modes
- Two cases where boosting would underperform bagging on this task, and how you would diagnose them with plots/diagnostics.

Compare bagging vs boosting on imbalanced data

Quick Overview

Fraud Detection on 10M Time-Ordered Transactions (0.5% Fraud)

Solution

Comments (0)