Build and validate a binary classifier
Company: Capital One
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: HR Screen
Using the features from the previous question (label is is_active_30d with ~1% positives), implement a scikit‑learn Pipeline that: (a) imputes as specified (via your functions), (b) encodes categoricals (OneHotEncoder(handle_unknown='ignore')), scales numerics (StandardScaler), and (c) trains a classifier robust to imbalance. Use GroupKFold with 5 folds, grouping by user_id to prevent user leakage across folds. Train two models: LogisticRegression(class_weight='balanced') and HistGradientBoostingClassifier; calibrate the better one with CalibratedClassifierCV on an inner fold. Report cross‑validated ROC‑AUC and PR‑AUC. On a held‑out validation fold, choose the smallest probability threshold that achieves precision ≥ 0.50 and report the corresponding recall, F1, and expected alerts per 100,000 users. Describe exactly how you ensure the threshold selection does not leak into cross‑validation (e.g., nested CV or final hold‑out).
Quick Answer: This question evaluates end-to-end machine learning pipeline skills, covering handling severe class imbalance, grouped cross-validation to prevent user-level leakage, preprocessing, model calibration, and probability threshold selection; it is in the Machine Learning domain for a Data Scientist role and primarily tests practical application with elements of conceptual understanding. Such problems are commonly asked to assess validation and model selection practices using metrics like PR-AUC and ROC-AUC, the use of careful grouping or nested validation to avoid leakage, and the ability to reason about calibrated probabilities and operational precision/recall trade-offs when choosing thresholds.