PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Apple

Explain annotation agreement and LLM vs human judges

Last updated: Jun 25, 2026

Quick Overview

This question tests a machine learning engineer's understanding of annotation reliability metrics and the trade-offs between human and LLM-based evaluation. It assesses conceptual and applied knowledge of inter-annotator agreement, metric selection across label types, and the validity of automated judges in model evaluation workflows.

  • hard
  • Apple
  • Machine Learning
  • Machine Learning Engineer

Explain annotation agreement and LLM vs human judges

Company: Apple

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

Define annotation agreement rate in the context of labeling and model evaluation. Explain how to measure agreement using common metrics (e.g., Cohen’s kappa for two raters, Fleiss’ kappa for many raters, and Krippendorff’s alpha for varied data types) and what each corrects for. Discuss limitations including class imbalance and prevalence effects, skewed label distributions, ordinal vs nominal scales, multi-label settings, annotator bias and expertise variance, and ambiguity in guidelines. Compare the pros and cons of using humans versus large language models (LLMs) as judges across cost, speed, consistency, bias/fairness, domain expertise, privacy/security, and transparency. Propose concrete methods to improve LLM-as-judge performance: design clear rubrics and scoring scales, provide few-shot exemplars and reference answers, use structured outputs, calibrate temperature and randomness, apply self-consistency or multi-pass adjudication, add gold checks and disagreement flags, perform periodic human spot-audits, and report reliability with confidence intervals.

Quick Answer: This question tests a machine learning engineer's understanding of annotation reliability metrics and the trade-offs between human and LLM-based evaluation. It assesses conceptual and applied knowledge of inter-annotator agreement, metric selection across label types, and the validity of automated judges in model evaluation workflows.

Related Interview Questions

  • Implement Masked Multi-Head Self-Attention - Apple (easy)
  • Compare DCN v1 vs v2 and A/B test - Apple (medium)
  • Explain dataset size, generalization, and U-Net skips - Apple (medium)
  • Analyze vision model failures - Apple (medium)
  • Compare audio preprocessing and training - Apple (medium)
|Home/Machine Learning/Apple

Explain annotation agreement and LLM vs human judges

Apple logo
Apple
Aug 13, 2025, 12:00 AM
hardMachine Learning EngineerTechnical ScreenMachine Learning
7
0

Annotation Agreement and LLM-vs-Human Judges

Context

You are on a model-evaluation team. Datasets are labeled, and model outputs are scored, by humans or by LLMs acting as judges. Before trusting any of those labels for training or evaluation, you need to quantify how consistently annotators agree, understand where those agreement metrics mislead, and decide when an LLM judge is an acceptable substitute for a human one — and how to make the LLM judge more reliable.

This is a conceptual / applied-ML discussion question with several parts. Work through each Part in order.

Constraints & Assumptions

  • Labels span the realistic range of evaluation tasks: binary (e.g. pass/fail), multi-class nominal (e.g. topic), ordinal (e.g. a 1–5 quality rating), and multi-label (a set of tags per item).
  • The number of annotators per item may vary, and some items may be missing labels from some annotators.
  • "Annotator" can mean a human rater or an LLM judge; the agreement machinery is the same in both cases.
  • Assume you can collect a pilot of a few hundred labeled items to estimate reliability before scaling.

Clarifying Questions to Ask

A strong candidate scopes the problem before diving in. Reasonable questions include:

  • What is the label type and scale (binary, nominal multi-class, ordinal, interval, or multi-label set)? This drives the metric choice.
  • How many annotators rate each item, and is it the same set of annotators for every item or a rotating pool?
  • Are labels complete, or are there missing ratings (which rules some metrics in or out)?
  • What is the class prevalence / how skewed is the label distribution? (Affects how to interpret chance-corrected metrics.)
  • What is agreement being used for — gating a training set, comparing two models, or auditing an LLM judge against humans?

Part 1 — Define the annotation agreement rate

Define what "annotation agreement rate" means in the context of labeling datasets and evaluating models. Distinguish raw (observed) agreement from chance-corrected agreement, and explain why we care about the distinction.

What This Part Should Cover

  • A clear definition tied to label reliability (reproducibility of labels), not label validity (correctness).
  • The observed-vs-chance distinction and why raw percent agreement is misleading on its own.
  • What high vs low agreement implies about guidelines, ambiguity, and trustworthiness for downstream training/eval.

Part 2 — Measure agreement with the right metric

Explain how to measure agreement using common metrics, and state precisely what each one corrects for and when to use it:

  • Cohen's kappa — two raters.
  • Fleiss' kappa — many raters.
  • Krippendorff's alpha — varied data types (nominal/ordinal/interval), missing labels, varying numbers of raters.

For each, say what inputs it needs, what it corrects for, and its main limitation. Note how to handle ordinal scales (where a near-miss should be penalized less than a far-miss).

What This Part Should Cover

  • Correct mapping of metric → use case (two raters / many raters / mixed-or-missing data).
  • For each metric: what PeP_ePe​ / DeD_eDe​ corrects for and the inputs required.
  • Awareness of weighted kappa / ordinal distance for ordinal labels.
  • Reporting reliability with uncertainty (e.g. bootstrap confidence intervals), not just a point estimate.

Part 3 — Limitations and pitfalls

Discuss the limitations and failure modes of agreement metrics, and how to mitigate each. Cover at least:

  • Class imbalance and prevalence effects (the "kappa paradox").
  • Skewed label distributions.
  • Ordinal vs nominal scales.
  • Multi-label settings (a set of labels per item).
  • Annotator bias and expertise variance.
  • Ambiguity in labeling guidelines.

What This Part Should Cover

  • Why high prevalence depresses/destabilizes kappa, and prevalence-robust alternatives or supplementary reporting.
  • Correct treatment of ordinal vs nominal, and of multi-label items (per-label aggregation vs set-similarity).
  • How annotator-quality variance is handled (e.g. model-based estimation of true labels and rater reliability).
  • The link between guideline ambiguity and low agreement, and how to fix it at the guideline level.

Part 4 — Humans vs LLMs as judges

Compare the pros and cons of using humans versus LLMs as judges. Cover each of these dimensions: cost, speed, consistency, bias/fairness, domain expertise, privacy/security, and transparency. Conclude with when you'd choose each, or a hybrid.

What This Part Should Cover

  • A balanced comparison across all seven dimensions (not just cost/speed).
  • Specific LLM-judge failure modes (prompt brittleness, position/verbosity/self-preference bias, unfaithful rationales) and specific human failure modes (drift, fatigue, individual bias).
  • A defensible recommendation, typically a hybrid (LLM first-pass + human spot-audit/adjudication).

Part 5 — Improving LLM-as-judge performance

Propose concrete, actionable methods to improve LLM-as-judge reliability. Address each of:

  • Design clear rubrics and scoring scales.
  • Provide few-shot exemplars and reference answers.
  • Use structured outputs.
  • Calibrate temperature / randomness.
  • Apply self-consistency or multi-pass adjudication.
  • Add gold checks and disagreement flags.
  • Perform periodic human spot-audits.
  • Report reliability with confidence intervals.

What This Part Should Cover

  • Concrete techniques for each listed lever (rubric anchoring, few-shot, JSON schema, temperature, self-consistency/pairwise, gold checks, audits).
  • The insight that you can quantify LLM-judge quality by treating it as a rater and computing agreement vs humans with confidence intervals.
  • Operational guardrails (prompt/model versioning, PII handling, drift monitoring).

What a Strong Answer Covers

Across all parts, a strong answer demonstrates these cross-cutting qualities:

  • Reliability vs validity discipline — never conflates "annotators agree" with "the label is correct," and keeps that distinction visible through the metric, the pitfalls, and the LLM-judge discussion.
  • Metric–task fit — chooses coefficients from the data shape (raters, label scale, missingness) rather than defaulting to one, and always pairs a point estimate with uncertainty.
  • Honest about failure modes — surfaces the kappa paradox, prevalence effects, and LLM-judge biases instead of presenting metrics or LLM judges as turnkey.
  • Operational closure — ends with a workable workflow (pilot → metric choice → guideline/rubric refinement → hybrid QA → report with CIs) that someone could actually run.

Follow-up Questions

  • Your team labels a binary task where 95% of items are the negative class. Cohen's kappa comes out near 0 despite 95% observed agreement. What's happening, and what would you report instead?
  • You want to know whether an LLM judge is "good enough" to replace one of two human raters. Exactly what would you measure, and what threshold would convince you?
  • Pairwise preference judging (A vs B) vs absolute 1–5 scoring: when is each more reliable for evaluating two models, and how would you aggregate pairwise results into a ranking?
  • An LLM judge gives consistent scores but its written rationales don't actually justify those scores. Why is this dangerous, and how would you detect and handle it?
Loading comments...

Browse More Questions

More Machine Learning•More Apple•More Machine Learning Engineer•Apple Machine Learning Engineer•Apple Machine Learning•Machine Learning Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.