Label Quality, Human Annotation, And Statistical Inference
Asked of: ML Engineer
Last updated
What's being tested
Candidates must show practical mastery of label quality and statistical inference in the ML training loop: modeling noisy multi-annotator labels, choosing aggregation and training targets, designing evaluation that accounts for annotation uncertainty, and reasoning about tradeoffs between label-estimation complexity and model performance. Interviewers probe whether the engineer can operationalize probabilistic label models, build noise-robust training pipelines, and produce audit-ready metrics with appropriate confidence bounds.
Core knowledge
-
Multi-annotator generative models (Dawid–Skene): model each annotator with a confusion matrix π^a_{k,l}=P(response=l | true=k). Use EM: E-step computes P(true=k | responses), M-step updates π by normalized counts. Watch identifiability without anchors.
-
Majority vs. probabilistic aggregation: majority vote is simple and biased when annotator competence varies; soft labels from
DAU/EM or Bayesian posterior improve training with small-to-moderate datasets. -
Bayesian labeling for binary tasks: with Beta(α,β) prior and k positives in n noisy observations, posterior is Beta(α+k, β+n−k). MLE for Bernoulli is and asymptotic var .
-
Confidence intervals for proportions: prefer Wilson or Agresti–Coull intervals over Wald for small/edge probabilities; bootstrap CIs are useful when label noise model is complex.
-
Noise-robust losses & training hacks: use label smoothing, bootstrapped cross-entropy, symmetric losses (e.g., mean absolute error surrogate) or co-teaching to mitigate noisy labels during SGD.
-
Train/eval split with annotator noise: compute aggregated labels per example before splitting; create a held-out gold set labeled by trusted experts for unbiased evaluation and calibration.
-
Annotator calibration & metadata: track per-annotator stats (accuracy, bias, confusion) with shrinkage priors; use these to weight votes or route hard examples to experts via human-in-the-loop.
-
Evaluation metrics under uncertainty: report point estimates and confidence intervals for precision/recall/F1; use precision-at-K or stratified PR curves when class skew and cost asymmetry exist.
-
Sampling and audit design: compute sample sizes for annotator accuracy or model false-negative rate audits using binomial variance; for target margin ε at confidence 1−α, n ≈ z_{1−α/2}^2 p(1−p)/ε^2.
-
Active/data-selection strategies: route high-disagreement examples to additional annotators or to experts; consider annotator-aware active learning criteria (entropy weighted by annotator reliability).
-
Identifiability and anchoring: EM on annotator models can converge to permuted/degenerate solutions; include a small gold anchor set or priors on annotator quality to fix scale and sign.
-
Operational constraints: balance labeling cost, latency, throughput, and model-update frequency; probabilistic labels help with low-volume, high-quality training data, while majority vote may suffice when scale is huge.
Worked example — Improve classifier with noisy multi-annotator labels
First 30s: ask how many annotators per example, whether annotator IDs and confidences are available, class balance, and whether a gold set exists. Skeleton answer pillars: (1) estimate true labels via a probabilistic aggregation (Dawid–Skene EM or Bayesian model) producing per-example posterior P(true=1), (2) design dataset splits using aggregated labels and reserve a trusted gold-held-out set for final evaluation, (3) choose training target (soft probabilities vs hardened labels) and a noise-robust loss (bootstrapped cross-entropy or label smoothing), (4) monitor per-annotator performance and disagreement to route ambiguous examples for relabeling or expert adjudication. Explicit tradeoff: probabilistic aggregation gives better signal for small datasets but increases complexity, latency, and risk of mis-specification; majority vote is cheap and often adequate at very large scale. Close: if more time, propose active learning to selectively re-annotate high-uncertainty examples, and run an ablation comparing model trained on majority labels vs. probabilistic soft labels using the gold set.
A second angle — Design a harmful video content moderation system
Same label-quality principles apply but different constraints: multi-modal inputs (video+audio+text) increase ambiguity, so disagreement and annotator calibration become more important. Use hierarchical taxonomy and per-example multi-stage labeling: lightweight crowd labeling for surface features, then expert adjudication for borderline or high-risk cases. Aggregation should be annotator-aware; store per-annotator confusion matrices by content subtype (e.g., violent vs sexual). Operational decisions—latency, throughput, and auditability—force hybrid policies: an automated model with conservative thresholds routes uncertain or high-severity content to human reviewers, and all removals have traceable labels and gold anchors for legal/audit needs.
Common pitfalls
Pitfall: Analytic mistake — aggregating labels after splitting or leaking information from the evaluation set.
If you compute annotator-weighted labels using signals that include examples in the test partition, you produce optimistic estimates. Aggregate per-example labels using only training labels, and hold a separate gold test set.
Pitfall: Communication mistake — not clarifying annotation process, label instructions, or cost/latency constraints.
Interviewers expect you to ask about annotator counts, expertise levels, labeling interfaces, and acceptable error tradeoffs; these shape aggregation and routing decisions.
Pitfall: Depth mistake — proposing only majority vote or only training-time fixes without evaluation/anchoring.
Majority vote ignores systematic annotator bias; likewise training with noisy labels without a gold-held-out set gives no reliable estimate of deployed performance.
Connections
Interviewers may pivot to active learning/annotation routing, calibration and uncertainty estimation for model outputs, or A/B testing of labeling schemes and threshold policies. Knowledge of robust optimization and model monitoring for distributional drift is also often relevant.
Further reading
-
Dawid & Skene (1979) — foundational EM method for modeling annotator confusion.
-
Raykar et al., "Learning From Crowds" (2010) — practical Bayesian approaches for combining labels and learning classifiers.
-
Patrini et al., "Making Deep Neural Networks Robust to Label Noise" (2017) — practical loss-correction techniques for noisy labels.
Practice questions
- Improve classifier with noisy multi-annotator labelsOpenAI · Machine Learning Engineer · Technical Screen · hard
- Design a harmful video content moderation systemOpenAI · Machine Learning Engineer · Onsite · hard
- Derive MLE and Bayesian posterior for BernoulliOpenAI · Machine Learning Engineer · Onsite · medium
- Build and troubleshoot image classification and backpropOpenAI · Machine Learning Engineer · Technical Screen · hard