Build an imbalanced classification pipeline with sklearn

Q: Build an imbalanced classification pipeline with sklearn

This question evaluates a candidate's competency in building an end-to-end imbalanced binary classification pipeline, covering preprocessing, resampling strategies, classifier comparison, cross-validated hyperparameter tuning, and evaluation metrics (ROC-AUC, PR-AUC) with tools like scikit-learn and imbalanced-learn.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Take-home: End-to-end Imbalanced Binary Classification Pipeline (scikit-learn + imbalanced-learn)

Context

You are given a tabular, imbalanced binary classification problem (y ∈ {0, 1}, with minority class 1). Build a clean, reproducible pipeline that prevents data leakage, compares imbalance strategies, and delivers evaluation artifacts and threshold tuning.

Requirements

Data preprocessing
- Use a Pipeline (and a ColumnTransformer if there are mixed numeric/categorical features) to perform:
  - Numeric scaling
  - Categorical encoding
  - Class imbalance handling via resampling
  - Modeling
- Ensure the resampling step happens inside each cross-validation fold to avoid leakage.
Imbalance strategies to compare
- Class-weight adjustments (e.g., class_weight="balanced") without explicit resampling.
- Explicit resampling (e.g., SMOTE or RandomUnderSampler) with class_weight=None.
Modeling and tuning
- Try at least two classifiers (e.g., LogisticRegression and RandomForest). You may optionally include an XGBoost-compatible estimator.
- Hyperparameter tuning with StratifiedKFold cross-validation.
- Primary metric: ROC-AUC. Also report PR-AUC (Average Precision).
- Use GridSearchCV (or equivalent) with scoring={'roc_auc', 'average_precision'} and refit='roc_auc'.
Evaluation artifacts and thresholding
- On a held-out test set, produce:
  - Confusion matrix at a selected threshold
  - ROC curve and PR curve
  - Calibration curve
- Demonstrate threshold tuning to:
  - Maximize F1, and
  - Maximize recall subject to precision ≥ 0.90.
Reproducibility and documentation
- Fixed random seeds; clean, well-documented code.
- Clear notes on how leakage is prevented and how to adapt to a real dataset.

Build an imbalanced classification pipeline with sklearn

Take-home: End-to-end Imbalanced Binary Classification Pipeline (scikit-learn + imbalanced-learn)

Context

Requirements

Solution

Comments (0)

Build an imbalanced classification pipeline with sklearn

Overview

Take-home: End-to-end Imbalanced Binary Classification Pipeline (scikit-learn + imbalanced-learn)

Context

Requirements

Solution

Comments (0)