PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Boston Consulting Group

Build a leak-free sklearn pipeline

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's ability to construct a leak-free scikit-learn pipeline encompassing column-wise preprocessing, imbalanced binary classification handling, stratified cross-validation, hyperparameter tuning, and model persistence.

  • medium
  • Boston Consulting Group
  • Machine Learning
  • Data Scientist

Build a leak-free sklearn pipeline

Company: Boston Consulting Group

Role: Data Scientist

Category: Machine Learning

Difficulty: medium

Interview Round: Take-home Project

You are training a binary classifier on table data with columns: numeric = [age, balance, n_logins_30d, minutes_watched_7d], categorical = [country, device_type, plan]. Positives are rare (~5%). Write sklearn code to: 1) Split into train/validation with stratification. 2) Create a Pipeline with a ColumnTransformer: numeric → SimpleImputer(strategy='median') + StandardScaler; categorical → OneHotEncoder(handle_unknown='ignore', min_frequency=0.01). 3) Handle imbalance via class_weight='balanced' (and explain when you would instead resample). 4) Tune at least two models (e.g., LogisticRegression and GradientBoostingClassifier or HistGradientBoostingClassifier) using StratifiedKFold and GridSearchCV with scoring='average_precision'. 5) Show that all preprocessing is inside the pipeline so no leakage can occur during CV. 6) Report the best model and its validation metrics (precision, recall, AP) and show how you would persist the trained pipeline for inference. Include code snippets and short justifications for key choices.

Quick Answer: This question evaluates a candidate's ability to construct a leak-free scikit-learn pipeline encompassing column-wise preprocessing, imbalanced binary classification handling, stratified cross-validation, hyperparameter tuning, and model persistence.

Related Interview Questions

  • Design and sample for credit default prediction - Boston Consulting Group (Medium)
  • Explain AUC, imbalance, losses, and networks - Boston Consulting Group (medium)
  • Build and evaluate imbalanced binary classifier - Boston Consulting Group (medium)
  • Reduce overfitting under constraints - Boston Consulting Group (hard)
  • Achieve 0.95 precision via thresholding - Boston Consulting Group (medium)
Boston Consulting Group logo
Boston Consulting Group
Oct 13, 2025, 9:49 PM
Data Scientist
Take-home Project
Machine Learning
5
0
Loading...

Take-home: Imbalanced Binary Classification Pipeline with scikit-learn

You are training a binary classifier on tabular data with the following feature schema:

  • Numeric: age, balance, n_logins_30d, minutes_watched_7d
  • Categorical: country, device_type, plan

Positives are rare (~5%). Complete the tasks below using scikit-learn.

Assume you have a pandas DataFrame df with those feature columns and a binary target column target (1 = positive, 0 = negative).

Tasks

  1. Split df into train/validation with stratification on target.
  2. Build a scikit-learn Pipeline with a ColumnTransformer:
    • Numeric → SimpleImputer(strategy='median') then StandardScaler
    • Categorical → OneHotEncoder(handle_unknown='ignore', min_frequency=0.01)
  3. Handle class imbalance via class_weight='balanced'. Briefly explain when you would instead use resampling (e.g., under/over-sampling).
  4. Tune at least two models using StratifiedKFold and GridSearchCV with scoring='average_precision':
    • LogisticRegression
    • GradientBoosting-type model (e.g., HistGradientBoostingClassifier)
  5. Ensure all preprocessing resides inside the Pipeline so no data leakage can occur during cross-validation.
  6. Report the best model and its validation metrics: precision, recall, and average precision (AP). Show how to persist the trained pipeline for inference.

Include code snippets and short justifications for key choices.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Boston Consulting Group•More Data Scientist•Boston Consulting Group Data Scientist•Boston Consulting Group Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.