Take-home: Imbalanced Binary Classification Pipeline with scikit-learn
You are training a binary classifier on tabular data with the following feature schema:
-
Numeric: age, balance, n_logins_30d, minutes_watched_7d
-
Categorical: country, device_type, plan
Positives are rare (~5%). Complete the tasks below using scikit-learn.
Assume you have a pandas DataFrame df with those feature columns and a binary target column target (1 = positive, 0 = negative).
Tasks
-
Split df into train/validation with stratification on target.
-
Build a scikit-learn Pipeline with a ColumnTransformer:
-
Numeric → SimpleImputer(strategy='median') then StandardScaler
-
Categorical → OneHotEncoder(handle_unknown='ignore', min_frequency=0.01)
-
Handle class imbalance via class_weight='balanced'. Briefly explain when you would instead use resampling (e.g., under/over-sampling).
-
Tune at least two models using StratifiedKFold and GridSearchCV with scoring='average_precision':
-
LogisticRegression
-
GradientBoosting-type model (e.g., HistGradientBoostingClassifier)
-
Ensure all preprocessing resides inside the Pipeline so no data leakage can occur during cross-validation.
-
Report the best model and its validation metrics: precision, recall, and average precision (AP). Show how to persist the trained pipeline for inference.
Include code snippets and short justifications for key choices.