This question evaluates a candidate's competency in statistical experimental design, accuracy estimation under noisy human labels, and reasoning about bias–variance trade-offs when allocating a fixed budget of annotation reviews.
You need to estimate the accuracy of an ML classifier on a population of subjects.
You can only afford K total human reviews. Each human review produces a binary judgment (0/1) for a subject (assume it is intended to represent the “true label,” but reviewers may be noisy).
You must choose how to allocate reviews:
Question: Which option is better for estimating the model’s accuracy, and under what assumptions? Provide a statistical argument, discuss bias/variance trade-offs, and propose a practical review design (including how you would quantify uncertainty with a confidence interval).
Login required