Problem
You are given a text dataset for a binary classification task (label in {0,1}). Each example has been labeled by multiple human annotators, and annotators often disagree (i.e., the same item can have conflicting labels).
You need to:
-
Perform a
dataset/label analysis
to understand the disagreement and likely label noise.
-
Propose a
training and evaluation approach
that improves
offline metrics
(e.g., F1 / AUC / accuracy), given the noisy multi-annotator labels.
Assumptions you may make (state them clearly)
-
You have access to: raw text, per-annotator labels, annotator IDs, and timestamps.
-
You can retrain models and change the labeling aggregation strategy, but you may have limited or no ability to collect new labels.
Deliverables
-
What analyses would you run and what would you look for?
-
How would you construct train/validation/test splits to avoid misleading offline metrics?
-
How would you convert multi-annotator labels into training targets?
-
What model/loss/thresholding/calibration choices would you try, and why?
-
What failure modes and edge cases could cause offline metric gains to be illusory?