Compare bagging vs boosting on imbalanced data
Company: TikTok
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
You must detect 0.5% fraud in 10,000,000 time-ordered transactions with 300 features (100 numeric, 200 one-hot). Choose between Random Forest (bagging) and Gradient Boosting (e.g., XGBoost/LightGBM). Specify: (1) which you would try first and why, referencing bias–variance trade-offs, margins, and how each method reacts to label noise on the minority class; (2) how you will handle class imbalance (class weights vs downsampling the majority vs SMOTE/SMOTE-ENN), which primary metric you will optimize (PR-AUC vs ROC-AUC) and why; (3) an initial hyperparameter grid for each approach (RF: n_estimators, max_depth, max_features, class_weight; GBM: learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, scale_pos_weight), with concrete starting values and expected effects; (4) your validation plan avoiding leakage (e.g., time-based blocked CV and grouping by user_id), and how you will use early stopping or OOB estimates; (5) two failure modes where boosting would underperform bagging on this task and how you would diagnose them with plots or diagnostics.
Quick Answer: This question evaluates competency in ensemble model selection (bagging vs boosting), handling extreme class imbalance, metric selection, hyperparameter tuning, time-aware validation, and failure-mode diagnosis for large-scale, time-ordered binary classification, and is commonly asked because it probes bias–variance trade-offs, robustness to label noise on the minority class, and practical evaluation and leakage-avoidance concerns in Machine Learning. It assesses both conceptual understanding of algorithmic trade-offs and statistical robustness as well as practical application skills such as designing time-blocked cross-validation, selecting appropriate metrics for imbalanced data, specifying hyperparameter grids, and interpreting diagnostic plots.