Build a leak-free sklearn pipeline

Q: Build a leak-free sklearn pipeline

This is a Machine Learning interview question from Boston Consulting Group for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Loading...

Take-home: Imbalanced Binary Classification Pipeline with scikit-learn

You are training a binary classifier on tabular data with the following feature schema:

Numeric: age, balance, n_logins_30d, minutes_watched_7d
Categorical: country, device_type, plan

Positives are rare (~5%). Complete the tasks below using scikit-learn.

Assume you have a pandas DataFrame df with those feature columns and a binary target column target (1 = positive, 0 = negative).

Tasks

Split df into train/validation with stratification on target.
Build a scikit-learn Pipeline with a ColumnTransformer:
- Numeric → SimpleImputer(strategy='median') then StandardScaler
- Categorical → OneHotEncoder(handle_unknown='ignore', min_frequency=0.01)
Handle class imbalance via class_weight='balanced'. Briefly explain when you would instead use resampling (e.g., under/over-sampling).
Tune at least two models using StratifiedKFold and GridSearchCV with scoring='average_precision':
- LogisticRegression
- GradientBoosting-type model (e.g., HistGradientBoostingClassifier)
Ensure all preprocessing resides inside the Pipeline so no data leakage can occur during cross-validation.
Report the best model and its validation metrics: precision, recall, and average precision (AP). Show how to persist the trained pipeline for inference.

Include code snippets and short justifications for key choices.

Build a leak-free sklearn pipeline

Take-home: Imbalanced Binary Classification Pipeline with scikit-learn

Tasks

Solution

Comments (0)