Human Feedback Data Quality

What's being tested

Interviewers look for the candidate's ability to diagnose, quantify, and mitigate data-quality issues that arise in human feedback used to train or evaluate models (e.g., RLHF, preference datasets, human evaluation labels). Expect probing on experimental design, bias identification, reliability measurement, metric impact on downstream models, and statistical tradeoffs when cleaning, filtering, or reweighting labels. Anthropic cares because low-quality or biased human signals produce misleading metrics, unstable training, and unsafe model behavior — the interviewer is checking that you can convert messy human labels into defensible, reproducible analytic decisions.

Core knowledge

Annotation schema design: Clear, mutually exclusive classes, examples for edge cases, and a decision tree reduce labeler variance; measure effects on downstream loss and evaluation metrics.
Inter-annotator agreement: Use Cohen's kappa for two coders, Krippendorff's alpha for >2 or missing data; interpret values (e.g., κ<0.4 poor, 0.4–0.6 moderate).
Label bias vs. label noise: Distinguish systematic bias (skewed by instruction, population, or framing) from random noise; bias shifts expected value, noise inflates variance and harms calibration.
Sampling & representation: Stratify by important covariates (e.g., prompt type, user cohort, timestamp) to avoid sample selection bias; when reweighting use inverse-probability weights: $w_i \propto \frac{P_{target}(x_i)}{P_{obs}(x_i)}.$
Gold-set and adjudication: Maintain a gold-labeled subset (expert adjudication) to estimate per-labeler accuracy, confusion matrices, and to calibrate labeler reliability scores for reweighting or filtering.
Labeler reliability modeling: Implement item-response or Dawid–Skene models to infer true labels and per-annotator confusion; useful when labels are sparse or annotator skill varies.
Calibration & confidence: Measure calibration of aggregated labels vs. model probabilities (e.g., reliability diagrams, Brier score); poor calibration indicates mis-specified aggregation or labeler miscalibration.
A/B and causal checks on label changes: When changing instructions or labeler pool, treat as an experiment: pre-specify primary metric, run power analysis, monitor secondary metrics and potential confounders.
Power analysis for annotation studies: Compute sample size for detecting a label distribution shift or improvement in downstream metric; for proportions use $n = \frac{(z_{1-\alpha/2}+z_{1-\beta})^2 (p_1(1-p_1)+p_2(1-p_2))}{(p_1-p_2)^2}.$
Downstream impact evaluation: Evaluate interventions by retraining or offline-simulating model changes and measuring held-out performance, fairness metrics, and robustness to adversarial prompts.
Anomaly & drift detection: Monitor label distributions, inter-annotator agreement, and labeler-specific confusion over time; use statistical tests (KS, chi-squared) with correction for multiple comparisons.
Multiple testing & p-hacking: Apply FDR control or Bonferroni when scanning many items/segments; pre-register primary hypotheses where feasible.

Worked example

Example task: "Diagnose a drop in model alignment metric after a labeling instruction change." First 30 seconds: ask which metric dropped, when the instruction change was applied, and whether a gold set or historical labeler pool exists. Frame analysis around (1) timeline correlation, (2) distributional comparison of labels and prompts pre/post, (3) annotator-level behavior, and (4) downstream model retrain vs. evaluation parity. Skeleton: compute time-series of the metric with confidence intervals; stratify by prompt types and user segments; compare inter-annotator agreement and confusion matrices; run an A/B test if a randomized rollout exists. Flag a key tradeoff: quick rollback yields faster recovery but loses data to assess longer-term instruction benefits. Close by saying: if more time, I'd run a causal-impact analysis (synthetic control) and retrain a short-running model on pre/post data to quantify expected end-user impact.

A second angle

Different task: "Design a labeling campaign to collect human preferences for toxic response ranking." Here the emphasis shifts to upfront design: define safe-by-design annotation guidelines, sampling that overrepresents minority toxic cases, and explicit negative/positive examples. You'd choose between a high-redundancy small gold set (for reliability) versus low-redundancy large coverage (for variety), balancing budget and variance. For evaluation, prefer held-out adversarial prompts and measure both aggregate preference rates and subgroup performance. The same tools — agreement metrics, adjudication, Dawid–Skene — apply, but constraints (safety triage, annotator training, ethics review) influence sample sizes and stopping rules.

Common pitfalls

Pitfall: Over-aggregating labels without checking labeler bias — aggregating majority votes can hide systematic skew from a dominant but biased annotator. Instead compute per-annotator confusion and apply weighted aggregation or adjudication.

Pitfall: Treating label distribution shift as model drift — a change in labels due to instruction or population does not necessarily mean model degradation; quantify whether downstream model predictions changed relative to a stable gold set before retraining.

Pitfall: Ignoring multiple comparisons when scanning segments — tempting to chase the smallest p-value across many segments; control false discovery rate and pre-specify primary checks to avoid chasing noise.

Connections

Label-quality discussions often pivot to causal inference (when you must estimate effect of instruction changes), model evaluation (how label noise affects metrics like AUC or calibration), and data-centric ML (prioritizing data fixes over model changes). Interviewers may transition to experiment design, reliability engineering of labeling pipelines, or fairness audits.