Take-home: End-to-end Imbalanced Binary Classification Pipeline (scikit-learn + imbalanced-learn)
Context
You are given a tabular, imbalanced binary classification problem (y ∈ {0, 1}, with minority class 1). Build a clean, reproducible pipeline that prevents data leakage, compares imbalance strategies, and delivers evaluation artifacts and threshold tuning.
Requirements
-
Data preprocessing
-
Use a Pipeline (and a ColumnTransformer if there are mixed numeric/categorical features) to perform:
-
Numeric scaling
-
Categorical encoding
-
Class imbalance handling via resampling
-
Modeling
-
Ensure the resampling step happens inside each cross-validation fold to avoid leakage.
-
Imbalance strategies to compare
-
Class-weight adjustments (e.g., class_weight="balanced") without explicit resampling.
-
Explicit resampling (e.g., SMOTE or RandomUnderSampler) with class_weight=None.
-
Modeling and tuning
-
Try at least two classifiers (e.g., LogisticRegression and RandomForest). You may optionally include an XGBoost-compatible estimator.
-
Hyperparameter tuning with StratifiedKFold cross-validation.
-
Primary metric: ROC-AUC. Also report PR-AUC (Average Precision).
-
Use GridSearchCV (or equivalent) with scoring={'roc_auc', 'average_precision'} and refit='roc_auc'.
-
Evaluation artifacts and thresholding
-
On a held-out test set, produce:
-
Confusion matrix at a selected threshold
-
ROC curve and PR curve
-
Calibration curve
-
Demonstrate threshold tuning to:
-
Maximize F1, and
-
Maximize recall subject to precision ≥ 0.90.
-
Reproducibility and documentation
-
Fixed random seeds; clean, well-documented code.
-
Clear notes on how leakage is prevented and how to adapt to a real dataset.