PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Capital One

Build and validate a binary classifier

Last updated: Mar 29, 2026

Quick Overview

This question evaluates end-to-end machine learning pipeline skills, covering handling severe class imbalance, grouped cross-validation to prevent user-level leakage, preprocessing, model calibration, and probability threshold selection; it is in the Machine Learning domain for a Data Scientist role and primarily tests practical application with elements of conceptual understanding. Such problems are commonly asked to assess validation and model selection practices using metrics like PR-AUC and ROC-AUC, the use of careful grouping or nested validation to avoid leakage, and the ability to reason about calibrated probabilities and operational precision/recall trade-offs when choosing thresholds.

  • hard
  • Capital One
  • Machine Learning
  • Data Scientist

Build and validate a binary classifier

Company: Capital One

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: HR Screen

Using the features from the previous question (label is is_active_30d with ~1% positives), implement a scikit‑learn Pipeline that: (a) imputes as specified (via your functions), (b) encodes categoricals (OneHotEncoder(handle_unknown='ignore')), scales numerics (StandardScaler), and (c) trains a classifier robust to imbalance. Use GroupKFold with 5 folds, grouping by user_id to prevent user leakage across folds. Train two models: LogisticRegression(class_weight='balanced') and HistGradientBoostingClassifier; calibrate the better one with CalibratedClassifierCV on an inner fold. Report cross‑validated ROC‑AUC and PR‑AUC. On a held‑out validation fold, choose the smallest probability threshold that achieves precision ≥ 0.50 and report the corresponding recall, F1, and expected alerts per 100,000 users. Describe exactly how you ensure the threshold selection does not leak into cross‑validation (e.g., nested CV or final hold‑out).

Quick Answer: This question evaluates end-to-end machine learning pipeline skills, covering handling severe class imbalance, grouped cross-validation to prevent user-level leakage, preprocessing, model calibration, and probability threshold selection; it is in the Machine Learning domain for a Data Scientist role and primarily tests practical application with elements of conceptual understanding. Such problems are commonly asked to assess validation and model selection practices using metrics like PR-AUC and ROC-AUC, the use of careful grouping or nested validation to avoid leakage, and the ability to reason about calibrated probabilities and operational precision/recall trade-offs when choosing thresholds.

Related Interview Questions

  • Deep-dive XGBoost handling and overfitting - Capital One (medium)
  • Build House Price Model Responsibly - Capital One (easy)
  • Design robber detection from surveillance video - Capital One (easy)
  • How would you design delay and watchlist models? - Capital One (medium)
  • Explain core ML concepts and lifecycle - Capital One (medium)
Capital One logo
Capital One
Oct 13, 2025, 9:49 PM
Data Scientist
HR Screen
Machine Learning
2
0

ML Pipeline with Grouped CV, Imbalance Handling, Calibration, and Thresholding

Context: You have a labeled dataset where the target is is_active_30d (~1% positives). Each row belongs to a user (user_id). You must avoid user leakage across folds.

Tasks:

  1. Build a scikit-learn Pipeline that:
    • (a) Imputes missing values using your previously defined functions for numeric and categorical features.
    • (b) Encodes categoricals with OneHotEncoder(handle_unknown='ignore') and scales numerics with StandardScaler.
    • (c) Trains classifiers robust to class imbalance.
  2. Use GroupKFold with 5 folds, grouping by user_id, to prevent user leakage across folds.
  3. Train two models:
    • LogisticRegression(class_weight='balanced')
    • HistGradientBoostingClassifier
  4. Select the better model by cross-validated PR-AUC (report both ROC-AUC and PR-AUC for completeness). Calibrate the better model with CalibratedClassifierCV on an inner fold.
  5. On a held-out validation fold (by user_id), choose the smallest probability threshold that achieves precision ≥ 0.50 and report the corresponding recall, F1, and expected alerts per 100,000 users.
  6. Describe exactly how you ensure the threshold selection does not leak into cross-validation (e.g., nested CV or final hold-out).

Assumptions to make explicit:

  • X is a pandas DataFrame that includes a user_id column and feature columns; y is a binary pandas Series for is_active_30d.
  • You have (or will define) minimal imputation functions for numeric and categorical columns; replace with your actual prior functions if they differ.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Capital One•More Data Scientist•Capital One Data Scientist•Capital One Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.