Build an Imbalanced Classification Pipeline (scikit-learn + imbalanced-learn)
Context
You are given a tabular dataset with a severely imbalanced binary target (e.g., minority class rate < 5%). Build an end-to-end classification pipeline that:
-
Applies standard preprocessing to numeric and categorical features.
-
Uses an appropriate resampling method to address imbalance.
-
Trains a classifier.
-
Evaluates precision, recall, and F1-score on a held-out test set.
Assume the input features X are in a pandas DataFrame and the target y is a pandas Series.
Requirements
-
Split the data into train/test using stratification to preserve class ratios.
-
Preprocess features:
-
Numeric: impute missing values and standardize.
-
Categorical: impute missing values and encode safely.
-
Resample only the training data (avoid leakage) using a suitable method:
-
If only numeric features: SMOTE is acceptable.
-
If mixed types: use SMOTENC to correctly handle categorical features.
-
Train a reasonable baseline classifier (e.g., logistic regression or tree-based model).
-
Report precision, recall, and F1-score on the test set (per-class and macro/weighted averages are acceptable).
Deliverables
-
Reproducible Python code using scikit-learn and imbalanced-learn that implements the above and prints metrics on the held-out test set.
-
Brief comments justifying major choices (resampling method, pipeline order).