Binary Fraud Classifier: Metrics, Thresholding, Calibration, and Online Evaluation
You inherit a binary fraud classifier used to decide whether to block an event. At one operating threshold, the holdout confusion matrix is:
-
TP = 200, FP = 800, FN = 100, TN = 99,900
Answer the following:
(a) Compute precision, recall, F1, and false-positive rate (FPR) at this threshold.
(b) Assuming a false negative costs 20andafalsepositivecosts0.20 in an environment with 1% fraud prevalence, describe how to choose a threshold using predicted probabilities to maximize expected utility. State which ranking metric (PR-AUC vs. ROC-AUC) better reflects improvements at low prevalence and why.
(c) Outline how you would check and fix probability calibration (e.g., reliability plot, isotonic or Platt scaling) and explain how calibration interacts with threshold selection.
(d) Propose an online evaluation plan with guardrails that avoids over-blocking legitimate users while improving catch rate. Include ideas such as shadow evaluation and a two-stage review for low-confidence positives, and define concrete success criteria and rollback triggers.