This question evaluates a candidate's competency in handling highly imbalanced binary classification problems, including data splitting and leakage prevention, imbalance mitigation techniques, appropriate metric selection and threshold calibration, algorithm selection for scalability, robust validation, and deployment monitoring.
You must build a binary classifier for fraud with a 0.2% positive rate and 10M rows × 500 features. Propose an end-to-end plan that covers: 1) data splitting with stratification and leakage prevention; 2) handling imbalance (class weights vs. focal loss, down/over-sampling, SMOTE variants, and when to use each); 3) appropriate metrics and why (PR curve, AUPRC, recall at fixed precision, cost-sensitive metrics) vs. why ROC-AUC is misleading; 4) threshold setting using cost matrices and calibration (Platt/Isotonic) and how you’d do post-deployment threshold tuning; 5) algorithm choices and justification (baseline logistic with class_weight, tree ensembles with balanced subsampling, anomaly detection fallback); 6) robust validation (time-based CV, group CV), data drift monitoring, and rejection rules for extreme edge cases; 7) a brief pseudocode of the training/evaluation loop that scales to this dataset.