Annotation Agreement Rate: Definition, Measurement, Limitations, and LLM-as-Judge Practices
Context
In labeling datasets and evaluating models, we often rely on humans (or LLMs) to assign labels. The annotation agreement rate captures how consistently different annotators make the same judgments and how reliable those labels are for training and evaluation.
Tasks
-
Define annotation agreement rate in the context of labeling and model evaluation.
-
Explain how to measure agreement with common metrics, including:
-
Cohen’s kappa (two raters)
-
Fleiss’ kappa (many raters)
-
Krippendorff’s alpha (varied data types, missing labels)
Explain what each metric corrects for.
-
Discuss limitations and pitfalls, including:
-
Class imbalance and prevalence effects
-
Skewed label distributions
-
Ordinal vs. nominal scales
-
Multi-label settings
-
Annotator bias and expertise variance
-
Ambiguity in labeling guidelines
-
Compare humans vs. large language models (LLMs) as judges across:
-
Cost, speed, consistency, bias/fairness, domain expertise, privacy/security, transparency
-
Propose concrete methods to improve LLM-as-judge performance:
-
Design clear rubrics and scoring scales
-
Provide few-shot exemplars and reference answers
-
Use structured outputs
-
Calibrate temperature and randomness
-
Apply self-consistency or multi-pass adjudication
-
Add gold checks and disagreement flags
-
Perform periodic human spot-audits
-
Report reliability with confidence intervals