Evaluate a model and choose metrics
Company: Apple
Role: Data Scientist
Category: Analytics & Experimentation
Difficulty: hard
Interview Round: Onsite
You own a fraud‑screening model for e‑commerce orders. Base rate: 0.7% fraud. Actions: flag→manual review cost=$3; flagging a legitimate order incurs $1 friction; missing a fraud costs $120; correctly passing a legitimate order costs $0. On a 100,000‑order validation set (700 positives), two candidate models at threshold 0.5 yield: Model A: TP=490, FP=4,900, FN=210, TN=94,400. Model B: TP=560, FP=8,400, FN=140, TN=90,900. Tasks: (a) Compute precision, recall, F1, ROC‑AUC proxy via TPR/FPR points, and expected cost per order for A and B at 0.5. Which model is better under the stated costs? (b) Derive the cost‑optimal threshold generally in terms of calibrated P(y=1|x) and costs; apply it here assuming perfect calibration and the base rate. (c) Discuss PR‑AUC vs ROC‑AUC under extreme imbalance, calibration checks (Brier, ECE), and decision curve analysis/net benefit. (d) Propose an offline evaluation plan robust to prevalence shift and a safe online A/B with guardrails (manual review SLAs, false accusation rate, holdout for drift), and how you’d monitor post‑launch for concept drift and fairness across user segments.
Quick Answer: This question evaluates a data scientist's competency in cost-sensitive model evaluation, handling extreme class imbalance, calibration and threshold derivation, experiment design, and post-launch monitoring and fairness within the Analytics & Experimentation domain.