You suspect many fake users are inflating comment counts. You will build a classifier to flag fake accounts for review. Propose and justify evaluation metrics and thresholding under two operational constraints, and do the required calculations: Context: 10,000,000 daily active users; true fake rate ~1%; review capacity 50,000 accounts/day. Two candidate models produce the following validation metrics at chosen thresholds: - Model A: precision = 0.60, recall = 0.20 at threshold τA. - Model B: precision = 0.20, recall = 0.80 at threshold τB. Tasks: 1) Choose offline metrics: Explain when to prefer PR-AUC over ROC-AUC. Specify primary metrics, including precision@K, recall@K, PR-AUC, calibrated Brier score, and cost-weighted utility. Justify choices given severe class imbalance and limited review capacity. 2) Capacity feasibility: For each model at its given threshold, compute expected true positives and false positives per day if applied to the full population. State whether each fits within the 50,000/day capacity and, if not, how you would set K or raise the threshold to meet capacity while maximizing expected true positives. 3) Business trade-offs: Given costs FP = $2 (review cost) and FN = $100 (missed abuse), select an Fβ score with appropriate β and justify. Show the expected daily cost under Model A and Model B at their current thresholds. 4) Thresholding and calibration: Describe how you would choose τ via a precision-recall curve subject to precision ≥ 0.7 or FP ≤ 20,000/day; explain how you would use probability calibration (Platt or isotonic) before thresholding. 5) Validation protocol: Describe time-based cross-validation to avoid leakage, offline-to-online guardrails (e.g., CUPED or AA-test), and which online metrics (report precision among reviewed, review throughput, and downstream abuse reduction) you would monitor.

This question evaluates a candidate's ability to select and interpret evaluation metrics for severely imbalanced classification problems, perform thresholding and probability calibration, and quantify capacity- and cost-constrained trade-offs between precision and recall.

How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Onsite rounds at Meta.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Meta during technical interviews.

Choose metrics for fake-user classifier

Classifying Fake Accounts: Metrics, Capacity, Thresholding, and Validation

Context

Population: 10,000,000 daily active users (DAU)
True fake rate (prevalence): ≈ 1% ⇒ ~100,000 fakes/day
Human review capacity: 50,000 accounts/day
Two candidate models at chosen thresholds τA and τB (from validation):
- Model A: precision = 0.60, recall = 0.20
- Model B: precision = 0.20, recall = 0.80

You will propose evaluation metrics and thresholding strategies under operational constraints and perform the requested calculations.

Tasks

Offline metrics selection: Explain when to prefer PR-AUC over ROC-AUC. Specify primary offline metrics, including precision@K, recall@K, PR-AUC, calibrated Brier score, and cost-weighted utility. Justify these given severe class imbalance and limited review capacity.
Capacity feasibility: For each model at its given threshold, compute expected true positives (TP) and false positives (FP) per day if applied to the full population. State whether each fits within the 50,000/day review capacity. If not, explain how to set K (top-K review) or raise the threshold to meet capacity while maximizing expected true positives.
Business trade-offs and Fβ: Given costs FP = $2 (review cost) and FN =$ 100 (missed abuse), select an Fβ score with an appropriate β and justify your choice. Compute the expected daily cost under Model A and Model B at their current thresholds.
Thresholding and calibration: Describe how to choose τ via a precision–recall (PR) curve under the constraints precision ≥ 0.7 or FP ≤ 20,000/day. Explain how you would apply probability calibration (e.g., Platt scaling or isotonic regression) before thresholding and why this matters.
Validation protocol: Describe a time-based cross-validation scheme to avoid leakage, offline-to-online guardrails (e.g., CUPED or an AA-test), and the online metrics you would monitor (e.g., precision among reviewed, review throughput, downstream abuse reduction).

Classifying Fake Accounts: Metrics, Capacity, Thresholding, and Validation

Context

Population: 10,000,000 daily active users (DAU)
True fake rate (prevalence): ≈ 1% ⇒ ~100,000 fakes/day
Human review capacity: 50,000 accounts/day
Two candidate models at chosen thresholds τA and τB (from validation):
- Model A: precision = 0.60, recall = 0.20
- Model B: precision = 0.20, recall = 0.80

You will propose evaluation metrics and thresholding strategies under operational constraints and perform the requested calculations.

Tasks

Offline metrics selection: Explain when to prefer PR-AUC over ROC-AUC. Specify primary offline metrics, including precision@K, recall@K, PR-AUC, calibrated Brier score, and cost-weighted utility. Justify these given severe class imbalance and limited review capacity.
Capacity feasibility: For each model at its given threshold, compute expected true positives (TP) and false positives (FP) per day if applied to the full population. State whether each fits within the 50,000/day review capacity. If not, explain how to set K (top-K review) or raise the threshold to meet capacity while maximizing expected true positives.
Business trade-offs and Fβ: Given costs FP = $2 (review cost) and FN =$ 100 (missed abuse), select an Fβ score with an appropriate β and justify your choice. Compute the expected daily cost under Model A and Model B at their current thresholds.
Thresholding and calibration: Describe how to choose τ via a precision–recall (PR) curve under the constraints precision ≥ 0.7 or FP ≤ 20,000/day. Explain how you would apply probability calibration (e.g., Platt scaling or isotonic regression) before thresholding and why this matters.
Validation protocol: Describe a time-based cross-validation scheme to avoid leakage, offline-to-online guardrails (e.g., CUPED or an AA-test), and the online metrics you would monitor (e.g., precision among reviewed, review throughput, downstream abuse reduction).

Choose metrics for fake-user classifier

Quick Overview

Classifying Fake Accounts: Metrics, Capacity, Thresholding, and Validation

Context

Tasks

Solution

Submit Your Answer to Earn 20XP

Choose metrics for fake-user classifier

Quick Overview

Classifying Fake Accounts: Metrics, Capacity, Thresholding, and Validation

Context

Tasks

Solution

Submit Your Answer to Earn 20XP