You need to estimate the accuracy of an ML classifier on a population of subjects.
You can only afford K total human reviews. Each human review produces a binary judgment (0/1) for a subject (assume it is intended to represent the “true label,” but reviewers may be noisy).
You must choose how to allocate reviews:
-
Option 1:
review
K different subjects once each
(1 review per subject).
-
Option 2:
review
fewer subjects
, but assign
multiple independent reviews per subject
, and use
majority vote
(or another aggregation).
Question: Which option is better for estimating the model’s accuracy, and under what assumptions? Provide a statistical argument, discuss bias/variance trade-offs, and propose a practical review design (including how you would quantify uncertainty with a confidence interval).