PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/DRW

Build pipeline for imbalanced classification

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's competency in designing and implementing an end-to-end imbalanced classification pipeline, covering numeric and categorical preprocessing, appropriate resampling methods for severe class imbalance, model training, and evaluation via precision, recall, and F1.

  • medium
  • DRW
  • ML System Design
  • Machine Learning Engineer

Build pipeline for imbalanced classification

Company: DRW

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: medium

Interview Round: Take-home Project

##### Question Using scikit-learn and imbalanced-learn, build a classification pipeline that handles severe class imbalance, performs standard preprocessing, applies an appropriate resampling method, trains a classifier, and outputs precision, recall, and F1-score on a held-out set.

Quick Answer: This question evaluates a candidate's competency in designing and implementing an end-to-end imbalanced classification pipeline, covering numeric and categorical preprocessing, appropriate resampling methods for severe class imbalance, model training, and evaluation via precision, recall, and F1.

Related Interview Questions

  • Train LinearSVC to beat baseline accuracy - DRW (medium)
  • Train LinearSVC to beat a hidden baseline - DRW (hard)
DRW logo
DRW
Aug 4, 2025, 10:55 AM
Machine Learning Engineer
Take-home Project
ML System Design
2
0

Build an Imbalanced Classification Pipeline (scikit-learn + imbalanced-learn)

Context

You are given a tabular dataset with a severely imbalanced binary target (e.g., minority class rate < 5%). Build an end-to-end classification pipeline that:

  • Applies standard preprocessing to numeric and categorical features.
  • Uses an appropriate resampling method to address imbalance.
  • Trains a classifier.
  • Evaluates precision, recall, and F1-score on a held-out test set.

Assume the input features X are in a pandas DataFrame and the target y is a pandas Series.

Requirements

  1. Split the data into train/test using stratification to preserve class ratios.
  2. Preprocess features:
    • Numeric: impute missing values and standardize.
    • Categorical: impute missing values and encode safely.
  3. Resample only the training data (avoid leakage) using a suitable method:
    • If only numeric features: SMOTE is acceptable.
    • If mixed types: use SMOTENC to correctly handle categorical features.
  4. Train a reasonable baseline classifier (e.g., logistic regression or tree-based model).
  5. Report precision, recall, and F1-score on the test set (per-class and macro/weighted averages are acceptable).

Deliverables

  • Reproducible Python code using scikit-learn and imbalanced-learn that implements the above and prints metrics on the held-out test set.
  • Brief comments justifying major choices (resampling method, pipeline order).

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More DRW•More Machine Learning Engineer•DRW Machine Learning Engineer•DRW ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.