Build an imbalanced classification pipeline with sklearn
Company: DRW
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Take-home Project
Using scikit‑learn and imbalanced‑learn, build an end‑to‑end binary classification pipeline for an imbalanced dataset:
- Create a Pipeline (optionally with ColumnTransformer) that performs preprocessing (scaling/encoding), resampling (e.g., RandomUnderSampler or SMOTE/ADASYN), and modeling (e.g., LogisticRegression, RandomForest, or XGBoost‑compatible wrapper).
- Compare 'class_weight' adjustments vs. explicit resampling; tune hyperparameters with StratifiedKFold cross‑validation using ROC‑AUC as the primary metric and also report PR‑AUC.
- Prevent leakage by applying resampling within each CV fold (i.e., inside the pipeline/GirdSearchCV).
- Produce evaluation artifacts: confusion matrix at a selected threshold, ROC and PR curves, calibration curve; demonstrate threshold tuning to optimize F1 or recall at precision ≥ 0.9.
- Provide clean, reproducible code with fixed random seeds and clear documentation.
Quick Answer: This question evaluates a candidate's competency in building an end-to-end imbalanced binary classification pipeline, covering preprocessing, resampling strategies, classifier comparison, cross-validated hyperparameter tuning, and evaluation metrics (ROC-AUC, PR-AUC) with tools like scikit-learn and imbalanced-learn.