Build and validate a binary classifier

Q: Build and validate a binary classifier

This question evaluates end-to-end machine learning pipeline skills, covering handling severe class imbalance, grouped cross-validation to prevent user-level leakage, preprocessing, model calibration, and probability threshold selection; it is in the Machine Learning domain for a Data Scientist role and primarily tests practical application with elements of conceptual understanding. Such problems are commonly asked to assess validation and model selection practices using metrics like PR-AUC and ROC-AUC, the use of careful grouping or nested validation to avoid leakage, and the ability to reason about calibrated probabilities and operational precision/recall trade-offs when choosing thresholds.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

ML Pipeline with Grouped CV, Imbalance Handling, Calibration, and Thresholding

Context: You have a labeled dataset where the target is is_active_30d (~1% positives). Each row belongs to a user (user_id). You must avoid user leakage across folds.

Tasks:

Build a scikit-learn Pipeline that:
- (a) Imputes missing values using your previously defined functions for numeric and categorical features.
- (b) Encodes categoricals with OneHotEncoder(handle_unknown='ignore') and scales numerics with StandardScaler.
- (c) Trains classifiers robust to class imbalance.
Use GroupKFold with 5 folds, grouping by user_id, to prevent user leakage across folds.
Train two models:
- LogisticRegression(class_weight='balanced')
- HistGradientBoostingClassifier
Select the better model by cross-validated PR-AUC (report both ROC-AUC and PR-AUC for completeness). Calibrate the better model with CalibratedClassifierCV on an inner fold.
On a held-out validation fold (by user_id), choose the smallest probability threshold that achieves precision ≥ 0.50 and report the corresponding recall, F1, and expected alerts per 100,000 users.
Describe exactly how you ensure the threshold selection does not leak into cross-validation (e.g., nested CV or final hold-out).

Assumptions to make explicit:

X is a pandas DataFrame that includes a user_id column and feature columns; y is a binary pandas Series for is_active_30d.
You have (or will define) minimal imputation functions for numeric and categorical columns; replace with your actual prior functions if they differ.

Build and validate a binary classifier

ML Pipeline with Grouped CV, Imbalance Handling, Calibration, and Thresholding

Solution

Comments (0)

Build and validate a binary classifier

Overview

ML Pipeline with Grouped CV, Imbalance Handling, Calibration, and Thresholding

Solution

Comments (0)