Fraud Detection With Rare Positives (0.5%) and Messy Data
You are designing a supervised transaction-level fraud detector. Positives (fraud) are rare at 0.5% of all cases. The dataset has ~10% missing values, heavy-tailed outliers, and high-cardinality categorical features (e.g., merchant_id, device_id).
Answer the following:
-
Preprocessing
a) Propose a concrete preprocessing pipeline to handle missing values and outliers for both numeric and categorical features. Address high-cardinality categoricals and leakage prevention.
-
Training/Validation Splits
b) Specify how you would split the data for training/validation/testing. Justify stratified cross-validation (and when to prefer time-aware or group-aware schemes).
-
Evaluation and Costing
c) Select and justify evaluation metrics in this imbalanced setting. Compare ROC-AUC vs PR-AUC vs F1 at a chosen threshold. Define a cost-sensitive objective and the optimal decision threshold given costs.
-
Class Imbalance
d) Provide a concrete method to address class imbalance (e.g., calibrated class weights, focal loss, or SMOTE). Explain exactly how to apply it without leakage.
-
Business Trade-offs
e) Give a real-world example where false positives are costlier than false negatives (or vice versa), and explain how that changes thresholding and monitoring in production.