Build and evaluate imbalanced binary classifier
Company: Boston Consulting Group
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Take-home Project
You are given a binary classification dataset with severe class imbalance (positive rate ≈1%). Each row has: id, event_date (YYYY-MM-DD), categorical: region ∈ {NA, EU, APAC, LATAM, MEA}, and numerical features f1…f50. Labels are y ∈ {0,1}. Tasks: a) Build a reproducible training pipeline that: - splits temporally into train (≤2025-06-01), validation (2025-06-02–2025-08-01), and test (2025-08-02–2025-09-01); - applies standardization to numeric features and one-hot encoding to region; - handles imbalance inside CV folds (e.g., class_weight='balanced', or SMOTE within each fold without leaking validation data); - trains a strong baseline (e.g., calibrated logistic regression or gradient boosting) and outputs well-calibrated probabilities (Platt or isotonic on validation). b) Report ROC-AUC and PR-AUC on the test split; also report recall at 5% FPR and the decision threshold that maximizes F1 under the constraint recall ≥ 0.90. c) Describe how you would choose the operating point for a production system with a hard requirement of at most 2 false positives per 1,000 predictions. d) Discuss how calibration might drift over time and one technique to monitor and re-calibrate without label leakage.
Quick Answer: This question evaluates a data scientist's skills in building reproducible machine learning pipelines for imbalanced binary classification, covering temporal splitting to avoid leakage, class imbalance handling, feature preprocessing, probability calibration, threshold selection, and monitoring for calibration drift.