Imbalanced Binary Classification: Learning, Evaluation, and Model Selection
Context
You are training a binary classifier where the positive class is rare (for example, 0.1–5% prevalence). You need to choose training strategies, evaluation metrics, cross-validation structure, and tuning methods that remain reliable under severe class imbalance and potential dataset shift.
Tasks
-
Explain the impact of class imbalance on both learning and evaluation.
-
Compare strategies to handle imbalance:
-
Random over-sampling and under-sampling
-
Synthetic methods (e.g., SMOTE, ADASYN)
-
Class weighting / cost-sensitive learning
-
Focal loss
-
Threshold moving (post-hoc decision thresholding)
-
Describe how to structure cross-validation to avoid leakage:
-
Perform any resampling within each training fold only
-
Use stratified folds; consider grouped or time-based splits when relevant
-
Recommend appropriate metrics (e.g., PR AUC, recall at fixed precision, balanced accuracy) and how to choose among them.
-
Outline how to tune hyperparameters under imbalance, including threshold selection.
-
Discuss trade-offs across variance, bias, runtime, and calibration for the above strategies.