Scenario
You are evaluating a new machine-learning model that detects harmful content on a large consumer platform. Leadership needs evidence that the new model outperforms the existing model, and to understand trade-offs between catching more harmful posts and avoiding over-removal of benign posts.
Task
Design a comprehensive evaluation plan to compare the new model against:
-
the existing (production) model, and
-
optionally, a minimal/no-model baseline (only if safe via safeguards).
Address both:
-
Offline evaluation using labeled data and confusion-matrix-based metrics.
-
Online A/B testing and experiment design.
Make the precision–recall trade-offs explicit, and connect model metrics to business outcomes.
Requirements
-
Define key metrics: precision, recall, F1/Fβ, ROC-AUC, PR-AUC, calibration (Brier score, reliability), threshold-specific metrics.
-
Describe dataset design, label quality, and class imbalance handling.
-
Propose thresholding/triage policies (e.g., auto-remove vs. send-to-review).
-
Outline online experiment design: unit of randomization, triggers, guardrails, primary/secondary KPIs, safety and ethical constraints.
-
Include validation steps, power/duration considerations, and guardrails for false positives and user impact.