Tune metrics for imbalanced classification

Q: Tune metrics for imbalanced classification

This question evaluates a data scientist's competency in machine learning for rare-event detection, testing skills in preprocessing messy data, handling high-cardinality categoricals, designing validation splits, selecting imbalanced evaluation metrics and cost-sensitive decision thresholds, and reasoning about operational trade-offs.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Fraud Detection With Rare Positives (0.5%) and Messy Data

You are designing a supervised transaction-level fraud detector. Positives (fraud) are rare at 0.5% of all cases. The dataset has ~10% missing values, heavy-tailed outliers, and high-cardinality categorical features (e.g., merchant_id, device_id).

Answer the following:

Preprocessing a) Propose a concrete preprocessing pipeline to handle missing values and outliers for both numeric and categorical features. Address high-cardinality categoricals and leakage prevention.
Training/Validation Splits b) Specify how you would split the data for training/validation/testing. Justify stratified cross-validation (and when to prefer time-aware or group-aware schemes).
Evaluation and Costing c) Select and justify evaluation metrics in this imbalanced setting. Compare ROC-AUC vs PR-AUC vs F1 at a chosen threshold. Define a cost-sensitive objective and the optimal decision threshold given costs.
Class Imbalance d) Provide a concrete method to address class imbalance (e.g., calibrated class weights, focal loss, or SMOTE). Explain exactly how to apply it without leakage.
Business Trade-offs e) Give a real-world example where false positives are costlier than false negatives (or vice versa), and explain how that changes thresholding and monitoring in production.

Tune metrics for imbalanced classification

Fraud Detection With Rare Positives (0.5%) and Messy Data

Solution

Comments (0)

Tune metrics for imbalanced classification

Overview

Fraud Detection With Rare Positives (0.5%) and Messy Data

Solution

Comments (0)