Explain annotation agreement and LLM vs human judges
Company: Apple
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
Define annotation agreement rate in the context of labeling and model evaluation. Explain how to measure agreement using common metrics (e.g., Cohen’s kappa for two raters, Fleiss’ kappa for many raters, and Krippendorff’s alpha for varied data types) and what each corrects for. Discuss limitations including class imbalance and prevalence effects, skewed label distributions, ordinal vs nominal scales, multi-label settings, annotator bias and expertise variance, and ambiguity in guidelines. Compare the pros and cons of using humans versus large language models (LLMs) as judges across cost, speed, consistency, bias/fairness, domain expertise, privacy/security, and transparency. Propose concrete methods to improve LLM-as-judge performance: design clear rubrics and scoring scales, provide few-shot exemplars and reference answers, use structured outputs, calibrate temperature and randomness, apply self-consistency or multi-pass adjudication, add gold checks and disagreement flags, perform periodic human spot-audits, and report reliability with confidence intervals.
Quick Answer: This question evaluates understanding of annotation agreement and inter-annotator reliability metrics along with the trade-offs between human and LLM judges for dataset labeling, within the Machine Learning domain.