Classifying Fake Accounts: Metrics, Capacity, Thresholding, and Validation
Context
-
Population: 10,000,000 daily active users (DAU)
-
True fake rate (prevalence): ≈ 1% ⇒ ~100,000 fakes/day
-
Human review capacity: 50,000 accounts/day
-
Two candidate models at chosen thresholds τA and τB (from validation):
-
Model A: precision = 0.60, recall = 0.20
-
Model B: precision = 0.20, recall = 0.80
You will propose evaluation metrics and thresholding strategies under operational constraints and perform the requested calculations.
Tasks
-
Offline metrics selection: Explain when to prefer PR-AUC over ROC-AUC. Specify primary offline metrics, including precision@K, recall@K, PR-AUC, calibrated Brier score, and cost-weighted utility. Justify these given severe class imbalance and limited review capacity.
-
Capacity feasibility: For each model at its given threshold, compute expected true positives (TP) and false positives (FP) per day if applied to the full population. State whether each fits within the 50,000/day review capacity. If not, explain how to set K (top-K review) or raise the threshold to meet capacity while maximizing expected true positives.
-
Business trade-offs and Fβ: Given costs FP =
2(reviewcost)andFN=
100 (missed abuse), select an Fβ score with an appropriate β and justify your choice. Compute the expected daily cost under Model A and Model B at their current thresholds.
-
Thresholding and calibration: Describe how to choose τ via a precision–recall (PR) curve under the constraints precision ≥ 0.7 or FP ≤ 20,000/day. Explain how you would apply probability calibration (e.g., Platt scaling or isotonic regression) before thresholding and why this matters.
-
Validation protocol: Describe a time-based cross-validation scheme to avoid leakage, offline-to-online guardrails (e.g., CUPED or an AA-test), and the online metrics you would monitor (e.g., precision among reviewed, review throughput, downstream abuse reduction).