PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/DRW

Build an imbalanced classification pipeline with sklearn

Last updated: Mar 29, 2026

Quick Overview

Build an imbalanced classification pipeline with sklearn evaluates core ML concepts, assumptions, math intuition, training/evaluation trade-offs, and practical failure modes in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.

  • hard
  • DRW
  • Machine Learning
  • Machine Learning Engineer

Build an imbalanced classification pipeline with sklearn

Company: DRW

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Take-home Project

Using scikit‑learn and imbalanced‑learn, build an end‑to‑end binary classification pipeline for an imbalanced dataset: - Create a Pipeline (optionally with ColumnTransformer) that performs preprocessing (scaling/encoding), resampling (e.g., RandomUnderSampler or SMOTE/ADASYN), and modeling (e.g., LogisticRegression, RandomForest, or XGBoost‑compatible wrapper). - Compare 'class_weight' adjustments vs. explicit resampling; tune hyperparameters with StratifiedKFold cross‑validation using ROC‑AUC as the primary metric and also report PR‑AUC. - Prevent leakage by applying resampling within each CV fold (i.e., inside the pipeline/GirdSearchCV). - Produce evaluation artifacts: confusion matrix at a selected threshold, ROC and PR curves, calibration curve; demonstrate threshold tuning to optimize F1 or recall at precision ≥ 0.9. - Provide clean, reproducible code with fixed random seeds and clear documentation.

Quick Answer: Build an imbalanced classification pipeline with sklearn evaluates core ML concepts, assumptions, math intuition, training/evaluation trade-offs, and practical failure modes in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.

Related Interview Questions

  • Explain core ML and DL fundamentals - DRW (medium)
  • Explain Transformers, activations, and training optimization - DRW (hard)
|Home/Machine Learning/DRW

Build an imbalanced classification pipeline with sklearn

DRW logo
DRW
Jul 31, 2025, 12:00 AM
hardMachine Learning EngineerTake-home ProjectMachine Learning
6
0

Build an imbalanced classification pipeline with sklearn

Take-home: End-to-end Imbalanced Binary Classification Pipeline (scikit-learn + imbalanced-learn)

Context

You are given a tabular, imbalanced binary classification problem (y ∈ {0, 1}, with minority class 1). Build a clean, reproducible pipeline that prevents data leakage, compares imbalance strategies, and delivers evaluation artifacts and threshold tuning.

Requirements

  1. Data preprocessing
    • Use a Pipeline (and a ColumnTransformer if there are mixed numeric/categorical features) to perform:
      • Numeric scaling
      • Categorical encoding
      • Class imbalance handling via resampling
      • Modeling
    • Ensure the resampling step happens inside each cross-validation fold to avoid leakage.
  2. Imbalance strategies to compare
    • Class-weight adjustments (e.g., class_weight="balanced") without explicit resampling.
    • Explicit resampling (e.g., SMOTE or RandomUnderSampler) with class_weight=None.
  3. Modeling and tuning
    • Try at least two classifiers (e.g., LogisticRegression and RandomForest). You may optionally include an XGBoost-compatible estimator.
    • Hyperparameter tuning with StratifiedKFold cross-validation.
    • Primary metric: ROC-AUC. Also report PR-AUC (Average Precision).
    • Use GridSearchCV (or equivalent) with scoring={'roc_auc', 'average_precision'} and refit='roc_auc'.
  4. Evaluation artifacts and thresholding
    • On a held-out test set, produce:
      • Confusion matrix at a selected threshold
      • ROC curve and PR curve
      • Calibration curve
    • Demonstrate threshold tuning to:
      • Maximize F1, and
      • Maximize recall subject to precision ≥ 0.90.
  5. Reproducibility and documentation
    • Fixed random seeds; clean, well-documented code.
    • Clear notes on how leakage is prevented and how to adapt to a real dataset.

Constraints & Assumptions

  • Preserve the scope, facts, inputs, and requested outputs from the prompt above.
  • If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
  • Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.

Clarifying Questions to Ask

  • Clarify the task, data shape, labels, constraints, and evaluation metric.
  • State assumptions behind the math or modeling technique you choose.
  • Connect theory to practical training, debugging, and deployment implications.

What a Strong Answer Covers

  • Correct definitions and formulas where the prompt requires them.
  • A practical explanation of how the method behaves on real data.
  • Trade-offs, failure modes, diagnostics, and mitigation strategies.
  • Evaluation choices that match the product or modeling objective.

Follow-up Questions

  • How would noisy labels, class imbalance, or distribution shift affect the answer?
  • What would you monitor after deployment?
  • Which baseline would you compare against first?
Loading comments...

Browse More Questions

More Machine Learning•More DRW•More Machine Learning Engineer•DRW Machine Learning Engineer•DRW Machine Learning•Machine Learning Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.