Fraud Detection on 10M Time-Ordered Transactions (0.5% Fraud)
You are building a binary classifier to detect 0.5% fraudulent events among 10,000,000 time-ordered transactions with 300 features (100 numeric, 200 one-hot). You must choose between a bagged Random Forest and a Gradient Boosting model (e.g., XGBoost or LightGBM).
Address the following:
-
Model choice
-
Which would you try first (Random Forest vs Gradient Boosting) and why?
-
Reference: bias–variance trade-offs, margins, and how each method reacts to label noise on the minority class.
-
Class imbalance and metric
-
How will you handle class imbalance (class weights vs downsampling the majority vs SMOTE/SMOTE-ENN)?
-
Which primary metric will you optimize (PR-AUC vs ROC-AUC) and why?
-
Hyperparameter grids (initial, concrete values and expected effects)
-
Random Forest: n_estimators, max_depth, max_features, class_weight.
-
Gradient Boosting (XGBoost/LightGBM): learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight.
-
Validation plan (avoid leakage)
-
Describe a time-based blocked cross-validation plan and grouping by user_id.
-
How you will use early stopping (for boosting) or OOB estimates (for RF).
-
Failure modes
-
Two cases where boosting would underperform bagging on this task, and how you would diagnose them with plots/diagnostics.