Choose metrics for fake-user classifier
Company: Meta
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Onsite
You suspect many fake users are inflating comment counts. You will build a classifier to flag fake accounts for review. Propose and justify evaluation metrics and thresholding under two operational constraints, and do the required calculations:
Context: 10,000,000 daily active users; true fake rate ~1%; review capacity 50,000 accounts/day. Two candidate models produce the following validation metrics at chosen thresholds:
- Model A: precision = 0.60, recall = 0.20 at threshold τA.
- Model B: precision = 0.20, recall = 0.80 at threshold τB.
Tasks:
1) Choose offline metrics: Explain when to prefer PR-AUC over ROC-AUC. Specify primary metrics, including precision@K, recall@K, PR-AUC, calibrated Brier score, and cost-weighted utility. Justify choices given severe class imbalance and limited review capacity.
2) Capacity feasibility: For each model at its given threshold, compute expected true positives and false positives per day if applied to the full population. State whether each fits within the 50,000/day capacity and, if not, how you would set K or raise the threshold to meet capacity while maximizing expected true positives.
3) Business trade-offs: Given costs FP = $2 (review cost) and FN = $100 (missed abuse), select an Fβ score with appropriate β and justify. Show the expected daily cost under Model A and Model B at their current thresholds.
4) Thresholding and calibration: Describe how you would choose τ via a precision-recall curve subject to precision ≥ 0.7 or FP ≤ 20,000/day; explain how you would use probability calibration (Platt or isotonic) before thresholding.
5) Validation protocol: Describe time-based cross-validation to avoid leakage, offline-to-online guardrails (e.g., CUPED or AA-test), and which online metrics (report precision among reviewed, review throughput, and downstream abuse reduction) you would monitor.
Quick Answer: This question evaluates a candidate's ability to select and interpret evaluation metrics for severely imbalanced classification problems, perform thresholding and probability calibration, and quantify capacity- and cost-constrained trade-offs between precision and recall.