Design and sample for credit default prediction

Q: Design and sample for credit default prediction

This is a Machine Learning interview question from Boston Consulting Group for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

A bank wants a model to predict 90-day credit card default at account-month level for proactive outreach. Class prevalence in production is about 2% defaults. Design the end-to-end approach and address sampling in depth.

a) Problem framing: Define the label precisely (observation window, prediction date, horizon), features available at scoring time, and a temporal data split that avoids leakage (e.g., train on data up to a cutoff date, validate on future months). List three concrete leakage risks unique to credit cards and how you would detect them.

b) Metrics: Choose evaluation metrics and operating points for imbalanced data (e.g., PR-AUC, recall at 5% FPR, expected utility). Justify why they match business goals under 2% prevalence.

c) Sampling strategy: You downsample negatives to speed training and target a 1:3 positive:negative ratio in the training set. Let π be true prevalence (0.02 in production) and π_s be the training prevalence (0.25 after downsampling). If the model outputs p_s = P(default | x, sampled) = 0.60 for an account, derive and compute the calibrated population probability p_pop = P(default | x) using prior-probability correction under prior shift:

odds_s = p_s / (1 - p_s)
odds_pop = odds_s × [(π / (1 - π)) / (π_s / (1 - π_s))]
p_pop = odds_pop / (1 + odds_pop) Provide the numeric p_pop and explain when class weights vs. probability recalibration are sufficient or when you need both.

d) Threshold via cost-benefit: Outreach costs $2 per account. If an account would default and you contact them, there is a 30% chance the intervention averts an average$ 150 loss. Choose the action rule and compute the breakeven probability threshold p* for contacting, then decide whether to contact the account from part (c) using p_pop.

e) Validation: Propose a time-based cross-validation scheme and a backtest showing stability under covariate and prior shift across regions and macro regimes. Include how you would monitor calibration and drift post-deployment and re-tune sampling if the true default rate changes from 2% to 1%.

f) Fairness and operations: Name two fairness checks (e.g., equal opportunity across age groups where legally permitted) and one operational guardrail to prevent over-throttling credit limits due to model uncertainty. Explain how sampling choices can bias these checks if not corrected.

Design and sample for credit default prediction

Comments (0)