## Problem
You are given a **text dataset** for a **binary classification** task (label in \{0,1\}). Each example has been labeled by **multiple human annotators**, and annotators often disagree (i.e., the same item can have conflicting labels).
You need to:
1. Perform a **dataset/label analysis** to understand the disagreement and likely label noise.
2. Propose a **training and evaluation approach** that improves **offline metrics** (e.g., F1 / AUC / accuracy), given the noisy multi-annotator labels.
### Assumptions you may make (state them clearly)
- You have access to: raw text, per-annotator labels, annotator IDs, and timestamps.
- You can retrain models and change the labeling aggregation strategy, but you may have limited or no ability to collect new labels.
### Deliverables
- What analyses would you run and what would you look for?
- How would you construct train/validation/test splits to avoid misleading offline metrics?
- How would you convert multi-annotator labels into training targets?
- What model/loss/thresholding/calibration choices would you try, and why?
- What failure modes and edge cases could cause offline metric gains to be illusory?
Quick Answer: This question evaluates a candidate's ability to analyze noisy multi-annotator labels, design label-aggregation and dataset-splitting strategies, and choose modeling and evaluation approaches for binary text classification.
Solution
This is an applied-ML design question: the hard part is not the classifier but **treating the labels as a noisy, structured signal rather than ground truth**. A strong answer separates *intrinsic* disagreement (genuine label ambiguity in the data) from *annotator* noise (sloppy or biased labelers), builds an evaluation harness that can't be gamed by that noise, and only then talks about losses and calibration.
### 0. Frame the problem and state assumptions
- **Task:** binary text classification, $y \in \{0,1\}$. Each item $i$ has labels from a set of annotators; the same item can receive conflicting labels.
- **Available signal:** raw text, per-annotator labels $\{l_{ij}\}$, annotator IDs $j$, timestamps. I can retrain and change aggregation, but **cannot cheaply buy new labels** — so I must extract maximum value from existing redundancy.
- **Key reframing** (this propagates through every part): disagreement has (at least) two sources, and the whole approach hinges on telling them apart.
1. **Aleatoric / intrinsic ambiguity** — the item is genuinely borderline (sarcasm, mixed sentiment, vague policy boundary). No amount of annotator quality fully removes this. The *right* target here is a soft label / probability, not a hard $0/1$.
2. **Annotator noise / bias** — a specific labeler is careless, rushed, adversarial, or systematically interprets the guideline differently. This *should* be down-weighted or corrected.
Conflating the two is the central failure mode: you either over-clean (delete the hard, informative examples) or you trust junk labels.
---
### Part 1 — Dataset & label-noise analysis
**A. Quantify agreement (overall and per item).**
- **Per-item agreement / entropy.** For item $i$ with $n_i$ annotator labels, compute the empirical positive rate $\hat{p}_i = \frac{1}{n_i}\sum_j l_{ij}$ and its label entropy
$$H_i = -\hat p_i\log \hat p_i-(1-\hat p_i)\log(1-\hat p_i).$$
High $H_i$ flags ambiguous items; this gives a *ranking* of items by how borderline they are.
- **Chance-corrected agreement.** Raw % agreement is misleading under class imbalance, so report **Cohen's $\kappa$** (pairwise) or **Fleiss' $\kappa$** (multi-rater):
$$\kappa = \frac{p_o - p_e}{1 - p_e}$$
where $p_o$ is observed agreement and $p_e$ is agreement expected by chance from the marginal class rates. A high % agreement with *low* $\kappa$ means raters are mostly agreeing on the majority class and giving you little signal on the minority class.
- **Krippendorff's $\alpha$** when redundancy is uneven (variable annotators per item, missing labels) — it handles incomplete designs gracefully, which directly addresses the "variable $n_i$" constraint.
**B. Profile the annotators (this is where the IDs and timestamps earn their keep).**
- **Per-annotator confusion vs. consensus.** For each annotator $j$, compute their agreement with the per-item majority, and estimate a $2\times2$ confusion (their TPR/TNR vs. consensus). Spot: high-bias raters (always-positive / always-negative), low-skill raters (near-random), and raters who flipped behavior over time.
- **Volume & speed.** Join labels to timestamps: labels-per-minute, burst patterns, day-of-week drift. Implausibly fast labeling correlates with noise; a sudden behavior change can flag a guideline update or a new, untrained cohort.
- **Annotator overlap / coverage.** How many items have $\ge 2$, $\ge 3$ labels? Items labeled by a *single* annotator carry that annotator's bias with no way to detect it — flag them as higher-risk.
**C. Separate real ambiguity from a bad labeler.** The discriminating test: pull a *stratified sample of high-entropy items and read them*. If a high-$H_i$ item is genuinely borderline on the guideline, that's intrinsic ambiguity → keep it as a soft label. If high disagreement instead concentrates on a few raters (your Part-1B profiles light up), that's annotator noise → down-weight. This qualitative pass also surfaces **systematic edge cases** (negation, code-switching, links/markup) the model will also get wrong.
**D. Check for confounds before modeling.**
- **Class imbalance** — drives metric choice (favor F1 / PR-AUC over accuracy).
- **Leakage signals** — does annotator ID, timestamp, or batch correlate with the label? If so, a model can "cheat" by learning the annotation process, and splits must isolate that (Part 2).
- **Near-duplicates** — exact and near-duplicate texts that land in different splits inflate metrics; dedup first.
---
### Part 2 — Splits that don't lie
The evaluation harness is the most important design decision, because every downstream gain is measured through it.
- **Build the cleanest possible test set, and freeze it.** Restrict the headline test set to items with **high redundancy and high agreement** (e.g., $\ge 3$ annotators, unanimous or near-unanimous), or to an adjudicated/gold subset if one exists. On items where humans genuinely can't agree, a hard **accuracy/F1** number is largely uninterpretable — there is no defensible $0/1$ ground truth — so headline metrics live on the least-ambiguous items.
- **Don't discard the ambiguous items — diagnose on them.** Keep a **separate "ambiguous" slice** and score the model there with a *soft* metric (log-loss or Brier against the empirical $\hat p_i$), which is meaningful even without a hard label. This slice never drives model selection on hard metrics, but it's how you check the model isn't getting worse on exactly the borderline cases that matter in production.
- **Group by item and by near-duplicate** so the *same or near-identical text never appears in two splits*. Otherwise you measure memorization.
- **Be deliberate about annotator overlap** — two regimes, and be explicit about which question you're answering:
- *Generalize to new text from known annotators*: a random item-level split is fine.
- *Generalize to new annotators / production*: hold out **entire annotators** (group-by-annotator), so the model can't exploit annotator-specific quirks. This is the honest test when production data won't carry annotator IDs.
- **Respect time.** Split temporally (train on older, test on newer) to catch concept/guideline drift and to mirror deployment, where you always predict forward.
- **Fit aggregation on the train fold only.** Majority/EM aggregation (Part 3) must be estimated within the training fold; otherwise the test items' votes inform the training targets and the metric leaks.
Pin a **stratified** split (by class and, ideally, by agreement level) so minority-class and hard items are represented in val/test.
---
### Part 3 — Turning multi-annotator labels into training targets
Treat this as a ladder and A/B the rungs against the frozen test set:
**1. Majority vote (baseline).** Simple, but discards uncertainty and is fragile with even rater counts / sparse redundancy.
**2. Soft labels (recommended default for ambiguous data).** Use the empirical agreement $\hat p_i$ (optionally smoothed) as a **probabilistic target** and train with soft-target cross-entropy:
$$\mathcal{L}_i = -\big[\hat p_i \log f_\theta(x_i) + (1-\hat p_i)\log\big(1-f_\theta(x_i)\big)\big].$$
This stops the model from being forced to commit on genuinely 50/50 items and acts as label smoothing where it's warranted. It directly encodes the §0 insight that ambiguity is real signal. It also *tends* to improve calibration relative to hard $0/1$ training (the target itself carries uncertainty) — a tendency, not a guarantee, so I still run the explicit calibration step in Part 4 and verify with ECE.
**3. Confidence / agreement weighting.** Weight each example's loss by its agreement (e.g., proportional to $|2\hat p_i - 1|$ or to redundancy $n_i$), so confident items dominate the decision boundary and 50/50 items contribute less to the hard cutoff.
**4. Annotator-quality–aware aggregation (EM / Dawid–Skene).** When you suspect specific bad raters, model each annotator's reliability with a latent-true-label model. Dawid–Skene posits a true label $z_i$ and per-annotator confusion matrices $\pi^{(j)}$ where $\pi^{(j)}_{km}=P(l_{ij}=m \mid z_i=k)$, and runs EM:
- **E-step:** posterior over the true label, $\;P(z_i = k \mid \cdot) \propto \rho_k \prod_j \prod_m \big(\pi^{(j)}_{km}\big)^{\mathbb{1}[l_{ij}=m]}$ (with class priors $\rho_k$).
- **M-step:** re-estimate $\pi^{(j)}$ and $\rho_k$ from the soft posteriors.
The output is a **per-item posterior probability** (a principled soft label) *and* estimated reliability per annotator — which lets you down-weight or exclude bad raters and is far more robust than majority vote when redundancy is uneven. The posteriors can feed rungs 2/3.
**5. Joint / end-to-end annotator modeling.** Learn the classifier and an annotator-noise layer together (a per-annotator confusion transform on top of the shared classifier logits, optimized over the *raw* per-annotator labels rather than aggregated ones). This avoids a lossy two-stage pipeline and uses every individual vote, at the cost of more moving parts and tuning.
**Justified default:** ship **EM/Dawid–Skene soft posteriors + agreement weighting**, with majority vote kept as the baseline to beat. Reserve the joint model for when redundancy is high and the two-stage approach plateaus. The default is principled (down-weights bad raters), keeps uncertainty (soft target), and degrades gracefully where the joint model wouldn't converge.
---
### Part 4 — Model, loss, calibration, thresholding
**Model.** For text, a fine-tuned pretrained transformer encoder (a BERT-family model) is the strong default; a TF-IDF + linear/GBM baseline is worth keeping for speed, interpretability, and as a sanity floor. The label strategy (Part 3) matters more than squeezing the architecture.
**Loss.**
- **Soft-target cross-entropy** when using soft labels (Part 3.2) — the loss must match the target.
- Under heavy class imbalance, **class-weighted** loss or **focal loss** stops the majority class from swamping gradients — but tune carefully, since focal loss interacts with calibration.
- Optional **noise-robust losses** (symmetric or generalized cross-entropy) if hard majority-vote labels are unavoidable and you want robustness to flipped labels.
**Regularization against noise.** Early stopping on a *clean* validation fold, plus dropout/weight decay. Flexible nets fit the *clean* structure first and **memorize noisy labels late in training**, so the best generalization is often *before* convergence on the noisy training loss — which is exactly why early stopping (not just more epochs) is the lever here.
**Calibration.** Train and inference differ in label-confidence distribution, so **explicitly calibrate** predicted probabilities on a held-out clean fold (temperature scaling for a neural net; isotonic/Platt for others) and report **ECE / reliability diagrams**. Calibration matters because both the soft-label training and the thresholding decision depend on probabilities meaning what they say.
**Thresholding.** Don't default to $0.5$.
- Choose the operating threshold by **maximizing the target metric on validation** (F1, or a cost-weighted objective if FP/FN costs differ), then report it on the frozen test set.
- Report **threshold-free** metrics (ROC-AUC, and PR-AUC for imbalance) so model quality is judged independently of any one threshold.
- If decisions are downstream-cost-sensitive, pick the threshold on an explicit cost matrix, not a symmetric metric.
---
### Part 5 — Why offline gains can be illusory (and guards)
For every "+1 point," name what would make it disappear and the checkable guard — spanning the five axes the question calls out:
- **Labels:** the test set inherits the same annotator noise, so you're measuring agreement-with-noise, not correctness. *Guard:* high-redundancy / high-agreement (ideally adjudicated) test set; diagnose separately on the ambiguous slice.
- **Splits:** near-duplicate texts or annotator-ID/batch correlations let the model cheat. *Guard:* dedup; group-by-item and (when appropriate) group-by-annotator splits; temporal split.
- **The metric itself:** a tiny F1 gain can be inside test-set noise, or a threshold tuned to one metric is brittle. *Guard:* bootstrap CIs on the test metric; require the gain to clear the CI and to hold on PR-AUC and across subgroups, not one headline number. Also: under imbalance, accuracy is fooled by a majority-class predictor → lead with F1 / PR-AUC / per-class recall.
- **Data you cleaned:** aggressively dropping high-disagreement items inflates test scores while making the model worse on the cases that matter. *Guard:* keep ambiguous items as soft labels and report the ambiguous-slice soft metric (log-loss / Brier vs. $\hat p_i$).
- **Offline-to-production gap:** offline labels come from a curated pool; production text/users differ, and calibration can drift (a model better by AUC but miscalibrated hurts any thresholded downstream system). *Guard:* temporal + group-by-annotator generalization tests, report ECE alongside discrimination, and confirm with live metrics after deploy.
---
### Recommended end-to-end recipe (what I'd actually do)
1. **Analyze:** Fleiss' $\kappa$ / Krippendorff's $\alpha$ overall, per-item entropy, per-annotator confusion-vs-consensus and speed profiles; read a stratified sample of high-entropy items; dedup.
2. **Freeze evaluation:** high-redundancy / high-agreement test set, grouped + temporal split, plus a separate ambiguous diagnostic slice (log-loss / Brier vs. $\hat p_i$); pick metrics (F1 / PR-AUC / per-class recall) up front.
3. **Targets:** majority-vote baseline → EM/Dawid–Skene soft posteriors with agreement weighting; down-weight/exclude raters EM flags as unreliable.
4. **Train:** fine-tuned transformer (TF-IDF+linear baseline alongside), soft-target CE, class weighting if needed, early stop on clean val.
5. **Calibrate** (temperature scaling) and **tune the threshold** on val for the target/cost metric; report threshold-free metrics too.
6. **Validate the win:** bootstrap CIs, subgroup and ambiguous-slice breakdowns, leakage audit — only then call it an improvement, and confirm with live metrics post-deploy.
---
### Follow-up Questions
**1. The annotator-reliability (EM) model degenerates at low redundancy or extreme class priors. Why, and what's the fallback?**
EM/Dawid–Skene needs each annotator to overlap with enough commonly-labeled items to *identify* their confusion matrix. When redundancy is low (many items with one or two labels), the per-annotator $\pi^{(j)}$ are estimated from almost no overlap and become arbitrary — the model can swap the meaning of the two latent classes (label-switching / non-identifiability) or fit reliabilities to noise. Extreme class priors make it worse: if nearly everyone labels one class, "always predict the majority class" is a near-optimal but useless fixed point, and a confidently-wrong-everywhere annotator can be mistaken for a flipped-but-reliable one. *Fallbacks:* add priors / smoothing (a Dirichlet/Beta prior on $\pi^{(j)}$ and $\rho_k$), pool sparse annotators into a shared "background" reliability, restrict EM to annotators with sufficient overlap, multiple random restarts with a sanity check that estimated reliabilities are plausible, and **fall back to majority vote / agreement-weighted soft labels** on the low-redundancy subset rather than trusting a collapsed EM solution.
**2. Offline F1 improved by a point, but online metrics didn't move. Likely explanations, and how to have caught it earlier?**
- The +1 was **inside test-set noise** → catch with bootstrap CIs before reporting.
- **Distribution shift**: offline curated text ≠ production traffic → catch with a temporal and group-by-annotator generalization test offline, not just a random split.
- **Metric mismatch**: offline F1 isn't what the product optimizes (the online KPI may be a downstream conversion/cost), or the chosen threshold doesn't match the live operating point → align the offline objective and threshold to the online cost up front.
- **Calibration drift / consumer mismatch**: a downstream system thresholds the probability differently, so a discrimination gain doesn't translate → monitor ECE and the live operating point.
- **Online noise / underpowered A/B**: the experiment can't detect a one-point change → power the test or measure a proxy.
The general earlier-catch is treating offline F1 as a *screen*, not proof: gate on CIs, a shift-aware split, and a metric/threshold that mirrors the live objective.
**3. Production won't carry annotator IDs at inference. How does that shape the split and the generalization question?**
If IDs vanish at inference, the model must **not** depend on annotator-specific signal, and the honest offline test is the **group-by-annotator (leave-annotators-out)** regime — train on some annotators, evaluate on entirely held-out ones — which estimates generalization to text labeled by people the model never "saw." A random item split would let the model exploit annotator quirks present in both train and test and overstate production performance. Concretely: annotator IDs may be used *during training* (e.g., the joint annotator-noise layer in Part 3.5) but must be **marginalized out at inference** (predict the consensus/true-label head, not a per-annotator head), and the reported headline metric should come from the leave-annotators-out + temporal split that matches deployment.