PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/DRW

Build an imbalanced classification pipeline with sklearn

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's competency in building an end-to-end imbalanced binary classification pipeline, covering preprocessing, resampling strategies, classifier comparison, cross-validated hyperparameter tuning, and evaluation metrics (ROC-AUC, PR-AUC) with tools like scikit-learn and imbalanced-learn.

  • hard
  • DRW
  • Machine Learning
  • Machine Learning Engineer

Build an imbalanced classification pipeline with sklearn

Company: DRW

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Take-home Project

Using scikit‑learn and imbalanced‑learn, build an end‑to‑end binary classification pipeline for an imbalanced dataset: - Create a Pipeline (optionally with ColumnTransformer) that performs preprocessing (scaling/encoding), resampling (e.g., RandomUnderSampler or SMOTE/ADASYN), and modeling (e.g., LogisticRegression, RandomForest, or XGBoost‑compatible wrapper). - Compare 'class_weight' adjustments vs. explicit resampling; tune hyperparameters with StratifiedKFold cross‑validation using ROC‑AUC as the primary metric and also report PR‑AUC. - Prevent leakage by applying resampling within each CV fold (i.e., inside the pipeline/GirdSearchCV). - Produce evaluation artifacts: confusion matrix at a selected threshold, ROC and PR curves, calibration curve; demonstrate threshold tuning to optimize F1 or recall at precision ≥ 0.9. - Provide clean, reproducible code with fixed random seeds and clear documentation.

Quick Answer: This question evaluates a candidate's competency in building an end-to-end imbalanced binary classification pipeline, covering preprocessing, resampling strategies, classifier comparison, cross-validated hyperparameter tuning, and evaluation metrics (ROC-AUC, PR-AUC) with tools like scikit-learn and imbalanced-learn.

Related Interview Questions

  • Explain core ML concepts - DRW (medium)
  • Explain core ML and DL fundamentals - DRW (medium)
  • Explain Transformers, activations, and training optimization - DRW (hard)
DRW logo
DRW
Jul 31, 2025, 12:00 AM
Machine Learning Engineer
Take-home Project
Machine Learning
3
0

Take-home: End-to-end Imbalanced Binary Classification Pipeline (scikit-learn + imbalanced-learn)

Context

You are given a tabular, imbalanced binary classification problem (y ∈ {0, 1}, with minority class 1). Build a clean, reproducible pipeline that prevents data leakage, compares imbalance strategies, and delivers evaluation artifacts and threshold tuning.

Requirements

  1. Data preprocessing
    • Use a Pipeline (and a ColumnTransformer if there are mixed numeric/categorical features) to perform:
      • Numeric scaling
      • Categorical encoding
      • Class imbalance handling via resampling
      • Modeling
    • Ensure the resampling step happens inside each cross-validation fold to avoid leakage.
  2. Imbalance strategies to compare
    • Class-weight adjustments (e.g., class_weight="balanced") without explicit resampling.
    • Explicit resampling (e.g., SMOTE or RandomUnderSampler) with class_weight=None.
  3. Modeling and tuning
    • Try at least two classifiers (e.g., LogisticRegression and RandomForest). You may optionally include an XGBoost-compatible estimator.
    • Hyperparameter tuning with StratifiedKFold cross-validation.
    • Primary metric: ROC-AUC. Also report PR-AUC (Average Precision).
    • Use GridSearchCV (or equivalent) with scoring={'roc_auc', 'average_precision'} and refit='roc_auc'.
  4. Evaluation artifacts and thresholding
    • On a held-out test set, produce:
      • Confusion matrix at a selected threshold
      • ROC curve and PR curve
      • Calibration curve
    • Demonstrate threshold tuning to:
      • Maximize F1, and
      • Maximize recall subject to precision ≥ 0.90.
  5. Reproducibility and documentation
    • Fixed random seeds; clean, well-documented code.
    • Clear notes on how leakage is prevented and how to adapt to a real dataset.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More DRW•More Machine Learning Engineer•DRW Machine Learning Engineer•DRW Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.