Design a model for imbalanced conversions
Company: Microsoft
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
You ran a campaign to 10,000 customers; 500 purchased (5% positive class). Design an end-to-end approach to identify which customers are most likely to purchase. Requirements:
- Start with a logistic regression baseline; detail your feature engineering (handling categoricals, scaling, interactions), data splitting, and prevention of leakage.
- Address class imbalance: compare class_weight, random over/under-sampling, and SMOTE; specify which metric you’ll optimize and why (e.g., PR-AUC, recall at fixed precision, cost-sensitive loss).
- Describe threshold selection for ranking vs. classification, probability calibration (Platt/isotonic), and how you would choose the top-N customers to target under a fixed budget.
- Explain how you would validate with stratified cross-validation, report confidence intervals, and monitor post-deployment drift and lift.
- List at least two feature selection methods (e.g., L1 penalty, mutual information, recursive feature elimination) and how you’d guard against overfitting while keeping interpretability.
Quick Answer: This question evaluates a data scientist's ability to design and validate end-to-end predictive models for imbalanced binary outcomes, encompassing feature engineering, class imbalance handling, probability calibration, thresholding for budgeted targeting, validation, monitoring, and interpretability.