Build an imbalanced classification pipeline with sklearn
Company: DRW
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Take-home Project
Using scikit‑learn and imbalanced‑learn, build an end‑to‑end binary classification pipeline for an imbalanced dataset:
- Create a Pipeline (optionally with ColumnTransformer) that performs preprocessing (scaling/encoding), resampling (e.g., RandomUnderSampler or SMOTE/ADASYN), and modeling (e.g., LogisticRegression, RandomForest, or XGBoost‑compatible wrapper).
- Compare 'class_weight' adjustments vs. explicit resampling; tune hyperparameters with StratifiedKFold cross‑validation using ROC‑AUC as the primary metric and also report PR‑AUC.
- Prevent leakage by applying resampling within each CV fold (i.e., inside the pipeline/GirdSearchCV).
- Produce evaluation artifacts: confusion matrix at a selected threshold, ROC and PR curves, calibration curve; demonstrate threshold tuning to optimize F1 or recall at precision ≥ 0.9.
- Provide clean, reproducible code with fixed random seeds and clear documentation.
Quick Answer: Build an imbalanced classification pipeline with sklearn evaluates core ML concepts, assumptions, math intuition, training/evaluation trade-offs, and practical failure modes in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.