PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/OpenAI

Improve classifier with noisy multi-annotator labels

Last updated: Jun 21, 2026

Quick Overview

This question evaluates a candidate's ability to analyze noisy multi-annotator labels, design label-aggregation and dataset-splitting strategies, and choose modeling and evaluation approaches for binary text classification.

  • hard
  • OpenAI
  • Machine Learning
  • Machine Learning Engineer

Improve classifier with noisy multi-annotator labels

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

## Problem You are given a **text dataset** for a **binary classification** task (label in \{0,1\}). Each example has been labeled by **multiple human annotators**, and annotators often disagree (i.e., the same item can have conflicting labels). You need to: 1. Perform a **dataset/label analysis** to understand the disagreement and likely label noise. 2. Propose a **training and evaluation approach** that improves **offline metrics** (e.g., F1 / AUC / accuracy), given the noisy multi-annotator labels. ### Assumptions you may make (state them clearly) - You have access to: raw text, per-annotator labels, annotator IDs, and timestamps. - You can retrain models and change the labeling aggregation strategy, but you may have limited or no ability to collect new labels. ### Deliverables - What analyses would you run and what would you look for? - How would you construct train/validation/test splits to avoid misleading offline metrics? - How would you convert multi-annotator labels into training targets? - What model/loss/thresholding/calibration choices would you try, and why? - What failure modes and edge cases could cause offline metric gains to be illusory?

Quick Answer: This question evaluates a candidate's ability to analyze noisy multi-annotator labels, design label-aggregation and dataset-splitting strategies, and choose modeling and evaluation approaches for binary text classification.

Solution

This is an applied-ML design question: the hard part is not the classifier but **treating the labels as a noisy, structured signal rather than ground truth**. A strong answer separates *intrinsic* disagreement (genuine label ambiguity in the data) from *annotator* noise (sloppy or biased labelers), builds an evaluation harness that can't be gamed by that noise, and only then talks about losses and calibration. ### 0. Frame the problem and state assumptions - **Task:** binary text classification, $y \in \{0,1\}$. Each item $i$ has labels from a set of annotators; the same item can receive conflicting labels. - **Available signal:** raw text, per-annotator labels $\{l_{ij}\}$, annotator IDs $j$, timestamps. I can retrain and change aggregation, but **cannot cheaply buy new labels** — so I must extract maximum value from existing redundancy. - **Key reframing** (this propagates through every part): disagreement has (at least) two sources, and the whole approach hinges on telling them apart. 1. **Aleatoric / intrinsic ambiguity** — the item is genuinely borderline (sarcasm, mixed sentiment, vague policy boundary). No amount of annotator quality fully removes this. The *right* target here is a soft label / probability, not a hard $0/1$. 2. **Annotator noise / bias** — a specific labeler is careless, rushed, adversarial, or systematically interprets the guideline differently. This *should* be down-weighted or corrected. Conflating the two is the central failure mode: you either over-clean (delete the hard, informative examples) or you trust junk labels. --- ### Part 1 — Dataset & label-noise analysis **A. Quantify agreement (overall and per item).** - **Per-item agreement / entropy.** For item $i$ with $n_i$ annotator labels, compute the empirical positive rate $\hat{p}_i = \frac{1}{n_i}\sum_j l_{ij}$ and its label entropy $$H_i = -\hat p_i\log \hat p_i-(1-\hat p_i)\log(1-\hat p_i).$$ High $H_i$ flags ambiguous items; this gives a *ranking* of items by how borderline they are. - **Chance-corrected agreement.** Raw % agreement is misleading under class imbalance, so report **Cohen's $\kappa$** (pairwise) or **Fleiss' $\kappa$** (multi-rater): $$\kappa = \frac{p_o - p_e}{1 - p_e}$$ where $p_o$ is observed agreement and $p_e$ is agreement expected by chance from the marginal class rates. A high % agreement with *low* $\kappa$ means raters are mostly agreeing on the majority class and giving you little signal on the minority class. - **Krippendorff's $\alpha$** when redundancy is uneven (variable annotators per item, missing labels) — it handles incomplete designs gracefully, which directly addresses the "variable $n_i$" constraint. **B. Profile the annotators (this is where the IDs and timestamps earn their keep).** - **Per-annotator confusion vs. consensus.** For each annotator $j$, compute their agreement with the per-item majority, and estimate a $2\times2$ confusion (their TPR/TNR vs. consensus). Spot: high-bias raters (always-positive / always-negative), low-skill raters (near-random), and raters who flipped behavior over time. - **Volume & speed.** Join labels to timestamps: labels-per-minute, burst patterns, day-of-week drift. Implausibly fast labeling correlates with noise; a sudden behavior change can flag a guideline update or a new, untrained cohort. - **Annotator overlap / coverage.** How many items have $\ge 2$, $\ge 3$ labels? Items labeled by a *single* annotator carry that annotator's bias with no way to detect it — flag them as higher-risk. **C. Separate real ambiguity from a bad labeler.** The discriminating test: pull a *stratified sample of high-entropy items and read them*. If a high-$H_i$ item is genuinely borderline on the guideline, that's intrinsic ambiguity → keep it as a soft label. If high disagreement instead concentrates on a few raters (your Part-1B profiles light up), that's annotator noise → down-weight. This qualitative pass also surfaces **systematic edge cases** (negation, code-switching, links/markup) the model will also get wrong. **D. Check for confounds before modeling.** - **Class imbalance** — drives metric choice (favor F1 / PR-AUC over accuracy). - **Leakage signals** — does annotator ID, timestamp, or batch correlate with the label? If so, a model can "cheat" by learning the annotation process, and splits must isolate that (Part 2). - **Near-duplicates** — exact and near-duplicate texts that land in different splits inflate metrics; dedup first. --- ### Part 2 — Splits that don't lie The evaluation harness is the most important design decision, because every downstream gain is measured through it. - **Build the cleanest possible test set, and freeze it.** Restrict the headline test set to items with **high redundancy and high agreement** (e.g., $\ge 3$ annotators, unanimous or near-unanimous), or to an adjudicated/gold subset if one exists. On items where humans genuinely can't agree, a hard **accuracy/F1** number is largely uninterpretable — there is no defensible $0/1$ ground truth — so headline metrics live on the least-ambiguous items. - **Don't discard the ambiguous items — diagnose on them.** Keep a **separate "ambiguous" slice** and score the model there with a *soft* metric (log-loss or Brier against the empirical $\hat p_i$), which is meaningful even without a hard label. This slice never drives model selection on hard metrics, but it's how you check the model isn't getting worse on exactly the borderline cases that matter in production. - **Group by item and by near-duplicate** so the *same or near-identical text never appears in two splits*. Otherwise you measure memorization. - **Be deliberate about annotator overlap** — two regimes, and be explicit about which question you're answering: - *Generalize to new text from known annotators*: a random item-level split is fine. - *Generalize to new annotators / production*: hold out **entire annotators** (group-by-annotator), so the model can't exploit annotator-specific quirks. This is the honest test when production data won't carry annotator IDs. - **Respect time.** Split temporally (train on older, test on newer) to catch concept/guideline drift and to mirror deployment, where you always predict forward. - **Fit aggregation on the train fold only.** Majority/EM aggregation (Part 3) must be estimated within the training fold; otherwise the test items' votes inform the training targets and the metric leaks. Pin a **stratified** split (by class and, ideally, by agreement level) so minority-class and hard items are represented in val/test. --- ### Part 3 — Turning multi-annotator labels into training targets Treat this as a ladder and A/B the rungs against the frozen test set: **1. Majority vote (baseline).** Simple, but discards uncertainty and is fragile with even rater counts / sparse redundancy. **2. Soft labels (recommended default for ambiguous data).** Use the empirical agreement $\hat p_i$ (optionally smoothed) as a **probabilistic target** and train with soft-target cross-entropy: $$\mathcal{L}_i = -\big[\hat p_i \log f_\theta(x_i) + (1-\hat p_i)\log\big(1-f_\theta(x_i)\big)\big].$$ This stops the model from being forced to commit on genuinely 50/50 items and acts as label smoothing where it's warranted. It directly encodes the §0 insight that ambiguity is real signal. It also *tends* to improve calibration relative to hard $0/1$ training (the target itself carries uncertainty) — a tendency, not a guarantee, so I still run the explicit calibration step in Part 4 and verify with ECE. **3. Confidence / agreement weighting.** Weight each example's loss by its agreement (e.g., proportional to $|2\hat p_i - 1|$ or to redundancy $n_i$), so confident items dominate the decision boundary and 50/50 items contribute less to the hard cutoff. **4. Annotator-quality–aware aggregation (EM / Dawid–Skene).** When you suspect specific bad raters, model each annotator's reliability with a latent-true-label model. Dawid–Skene posits a true label $z_i$ and per-annotator confusion matrices $\pi^{(j)}$ where $\pi^{(j)}_{km}=P(l_{ij}=m \mid z_i=k)$, and runs EM: - **E-step:** posterior over the true label, $\;P(z_i = k \mid \cdot) \propto \rho_k \prod_j \prod_m \big(\pi^{(j)}_{km}\big)^{\mathbb{1}[l_{ij}=m]}$ (with class priors $\rho_k$). - **M-step:** re-estimate $\pi^{(j)}$ and $\rho_k$ from the soft posteriors. The output is a **per-item posterior probability** (a principled soft label) *and* estimated reliability per annotator — which lets you down-weight or exclude bad raters and is far more robust than majority vote when redundancy is uneven. The posteriors can feed rungs 2/3. **5. Joint / end-to-end annotator modeling.** Learn the classifier and an annotator-noise layer together (a per-annotator confusion transform on top of the shared classifier logits, optimized over the *raw* per-annotator labels rather than aggregated ones). This avoids a lossy two-stage pipeline and uses every individual vote, at the cost of more moving parts and tuning. **Justified default:** ship **EM/Dawid–Skene soft posteriors + agreement weighting**, with majority vote kept as the baseline to beat. Reserve the joint model for when redundancy is high and the two-stage approach plateaus. The default is principled (down-weights bad raters), keeps uncertainty (soft target), and degrades gracefully where the joint model wouldn't converge. --- ### Part 4 — Model, loss, calibration, thresholding **Model.** For text, a fine-tuned pretrained transformer encoder (a BERT-family model) is the strong default; a TF-IDF + linear/GBM baseline is worth keeping for speed, interpretability, and as a sanity floor. The label strategy (Part 3) matters more than squeezing the architecture. **Loss.** - **Soft-target cross-entropy** when using soft labels (Part 3.2) — the loss must match the target. - Under heavy class imbalance, **class-weighted** loss or **focal loss** stops the majority class from swamping gradients — but tune carefully, since focal loss interacts with calibration. - Optional **noise-robust losses** (symmetric or generalized cross-entropy) if hard majority-vote labels are unavoidable and you want robustness to flipped labels. **Regularization against noise.** Early stopping on a *clean* validation fold, plus dropout/weight decay. Flexible nets fit the *clean* structure first and **memorize noisy labels late in training**, so the best generalization is often *before* convergence on the noisy training loss — which is exactly why early stopping (not just more epochs) is the lever here. **Calibration.** Train and inference differ in label-confidence distribution, so **explicitly calibrate** predicted probabilities on a held-out clean fold (temperature scaling for a neural net; isotonic/Platt for others) and report **ECE / reliability diagrams**. Calibration matters because both the soft-label training and the thresholding decision depend on probabilities meaning what they say. **Thresholding.** Don't default to $0.5$. - Choose the operating threshold by **maximizing the target metric on validation** (F1, or a cost-weighted objective if FP/FN costs differ), then report it on the frozen test set. - Report **threshold-free** metrics (ROC-AUC, and PR-AUC for imbalance) so model quality is judged independently of any one threshold. - If decisions are downstream-cost-sensitive, pick the threshold on an explicit cost matrix, not a symmetric metric. --- ### Part 5 — Why offline gains can be illusory (and guards) For every "+1 point," name what would make it disappear and the checkable guard — spanning the five axes the question calls out: - **Labels:** the test set inherits the same annotator noise, so you're measuring agreement-with-noise, not correctness. *Guard:* high-redundancy / high-agreement (ideally adjudicated) test set; diagnose separately on the ambiguous slice. - **Splits:** near-duplicate texts or annotator-ID/batch correlations let the model cheat. *Guard:* dedup; group-by-item and (when appropriate) group-by-annotator splits; temporal split. - **The metric itself:** a tiny F1 gain can be inside test-set noise, or a threshold tuned to one metric is brittle. *Guard:* bootstrap CIs on the test metric; require the gain to clear the CI and to hold on PR-AUC and across subgroups, not one headline number. Also: under imbalance, accuracy is fooled by a majority-class predictor → lead with F1 / PR-AUC / per-class recall. - **Data you cleaned:** aggressively dropping high-disagreement items inflates test scores while making the model worse on the cases that matter. *Guard:* keep ambiguous items as soft labels and report the ambiguous-slice soft metric (log-loss / Brier vs. $\hat p_i$). - **Offline-to-production gap:** offline labels come from a curated pool; production text/users differ, and calibration can drift (a model better by AUC but miscalibrated hurts any thresholded downstream system). *Guard:* temporal + group-by-annotator generalization tests, report ECE alongside discrimination, and confirm with live metrics after deploy. --- ### Recommended end-to-end recipe (what I'd actually do) 1. **Analyze:** Fleiss' $\kappa$ / Krippendorff's $\alpha$ overall, per-item entropy, per-annotator confusion-vs-consensus and speed profiles; read a stratified sample of high-entropy items; dedup. 2. **Freeze evaluation:** high-redundancy / high-agreement test set, grouped + temporal split, plus a separate ambiguous diagnostic slice (log-loss / Brier vs. $\hat p_i$); pick metrics (F1 / PR-AUC / per-class recall) up front. 3. **Targets:** majority-vote baseline → EM/Dawid–Skene soft posteriors with agreement weighting; down-weight/exclude raters EM flags as unreliable. 4. **Train:** fine-tuned transformer (TF-IDF+linear baseline alongside), soft-target CE, class weighting if needed, early stop on clean val. 5. **Calibrate** (temperature scaling) and **tune the threshold** on val for the target/cost metric; report threshold-free metrics too. 6. **Validate the win:** bootstrap CIs, subgroup and ambiguous-slice breakdowns, leakage audit — only then call it an improvement, and confirm with live metrics post-deploy. --- ### Follow-up Questions **1. The annotator-reliability (EM) model degenerates at low redundancy or extreme class priors. Why, and what's the fallback?** EM/Dawid–Skene needs each annotator to overlap with enough commonly-labeled items to *identify* their confusion matrix. When redundancy is low (many items with one or two labels), the per-annotator $\pi^{(j)}$ are estimated from almost no overlap and become arbitrary — the model can swap the meaning of the two latent classes (label-switching / non-identifiability) or fit reliabilities to noise. Extreme class priors make it worse: if nearly everyone labels one class, "always predict the majority class" is a near-optimal but useless fixed point, and a confidently-wrong-everywhere annotator can be mistaken for a flipped-but-reliable one. *Fallbacks:* add priors / smoothing (a Dirichlet/Beta prior on $\pi^{(j)}$ and $\rho_k$), pool sparse annotators into a shared "background" reliability, restrict EM to annotators with sufficient overlap, multiple random restarts with a sanity check that estimated reliabilities are plausible, and **fall back to majority vote / agreement-weighted soft labels** on the low-redundancy subset rather than trusting a collapsed EM solution. **2. Offline F1 improved by a point, but online metrics didn't move. Likely explanations, and how to have caught it earlier?** - The +1 was **inside test-set noise** → catch with bootstrap CIs before reporting. - **Distribution shift**: offline curated text ≠ production traffic → catch with a temporal and group-by-annotator generalization test offline, not just a random split. - **Metric mismatch**: offline F1 isn't what the product optimizes (the online KPI may be a downstream conversion/cost), or the chosen threshold doesn't match the live operating point → align the offline objective and threshold to the online cost up front. - **Calibration drift / consumer mismatch**: a downstream system thresholds the probability differently, so a discrimination gain doesn't translate → monitor ECE and the live operating point. - **Online noise / underpowered A/B**: the experiment can't detect a one-point change → power the test or measure a proxy. The general earlier-catch is treating offline F1 as a *screen*, not proof: gate on CIs, a shift-aware split, and a metric/threshold that mirrors the live objective. **3. Production won't carry annotator IDs at inference. How does that shape the split and the generalization question?** If IDs vanish at inference, the model must **not** depend on annotator-specific signal, and the honest offline test is the **group-by-annotator (leave-annotators-out)** regime — train on some annotators, evaluate on entirely held-out ones — which estimates generalization to text labeled by people the model never "saw." A random item split would let the model exploit annotator quirks present in both train and test and overstate production performance. Concretely: annotator IDs may be used *during training* (e.g., the joint annotator-noise layer in Part 3.5) but must be **marginalized out at inference** (predict the consensus/true-label head, not a per-annotator head), and the reported headline metric should come from the leave-annotators-out + temporal split that matches deployment.

Related Interview Questions

  • Implement 1NN with NumPy - OpenAI (medium)
  • Compute entropy and implement 1-NN - OpenAI (medium)
  • Defend a Research Direction and Experiment Design - OpenAI (medium)
  • Implement Backprop for a Tiny Network - OpenAI (hard)
  • Debug MiniGPT and Backpropagate Matmul - OpenAI (medium)
|Home/Machine Learning/OpenAI

Improve classifier with noisy multi-annotator labels

OpenAI logo
OpenAI
Feb 11, 2026, 12:00 AM
hardMachine Learning EngineerTechnical ScreenMachine Learning
848
0

Problem

You are given a text dataset for a binary classification task (label in \{0,1\\}). Each example has been labeled by multiple human annotators, and annotators often disagree — the same item can receive conflicting labels.

Your job has two halves:

  1. Perform a dataset / label analysis to understand the disagreement and the likely sources of label noise.
  2. Propose a training and evaluation approach that improves offline metrics (e.g., F1 / AUC / accuracy), given the noisy multi-annotator labels.

This is an open-ended applied-ML design discussion: there is no single "correct" pipeline. The interviewer is looking for how you reason about treating labels as a noisy, structured signal rather than as ground truth, and how you keep your offline evaluation honest.

Constraints & Assumptions

State these explicitly (and any others you add) as you go:

  • Available signal: raw text, per-annotator labels, annotator IDs, and label timestamps.
  • Levers you control: you can retrain models and change the label-aggregation strategy.
  • Hard limitation: you have limited or no ability to collect new labels , so you must extract maximum value from the existing annotation redundancy.
  • The class distribution may be imbalanced , and the number of annotators per item may vary (some items have one label, others many).

Clarifying Questions to Ask

A strong candidate scopes the whole problem before designing. Reasonable questions for the interviewer:

  • How much redundancy is there — what's the distribution of annotators-per-item, and how many items have only a single label?
  • What is the class balance , and which error type (false positive vs. false negative) is more costly downstream?
  • Is the disagreement believed to stem more from genuinely ambiguous items or from a few unreliable annotators — or is that exactly what we're trying to find out?
  • Will the production input distribution carry annotator IDs, or do we need to generalize to brand-new annotators / unseen text?
  • Is there any adjudicated / gold subset we can trust as ground truth?
  • Are there known guideline changes over the labeling period that could explain temporal drift?

Part 1 — Dataset & label-noise analysis

What analyses would you run, and what would you look for? Specifically, how would you (a) quantify how much annotators disagree, (b) characterize individual annotators, and (c) decide whether a given disagreement reflects real ambiguity vs. a bad labeler?

What This Part Should Cover

  • Chance-corrected agreement rather than raw % agreement, and an awareness of why imbalance inflates the naive number.
  • Per-item uncertainty (an empirical positive rate / entropy) used to rank items by ambiguity.
  • Per-annotator reliability derived from IDs (agreement-vs-consensus, bias toward one class, labeling speed/volume from timestamps).
  • A concrete test for separating intrinsic ambiguity from annotator noise (e.g., qualitatively reading high-entropy items, checking whether disagreement concentrates on a few raters).

Part 2 — Splits that don't lie

How would you construct train / validation / test splits so that your offline metrics are not misleading? What is your "ground truth" for the test set when humans themselves disagree?

What This Part Should Cover

  • A defensible test ground truth — restricting the headline test set to high-redundancy, high-agreement (or adjudicated) items.
  • Leakage control — grouping by item / near-duplicate text, and isolating annotator-ID or batch artifacts.
  • Two explicit generalization regimes — known-annotator vs. new-annotator (group-by-annotator), tied to whether production carries annotator IDs.
  • Temporal discipline — splitting forward in time, and fitting any aggregation on the train fold only so test labels don't bleed into training targets.

Part 3 — Turning multi-annotator labels into training targets

How would you convert the multiple per-annotator labels into the targets you actually train on? Lay out the options and give a justified default.

What This Part Should Cover

  • A spectrum of options — majority vote → soft/probabilistic labels → confidence weighting → annotator-quality–aware aggregation (e.g., Dawid–Skene/EM) → joint end-to-end annotator modeling.
  • Soft labels as the reframing for ambiguous items, motivated by what the "right" target is for a 50/50 item.
  • A justified default rather than a list, with the trade-offs (redundancy required, complexity, robustness) made explicit.

Part 4 — Model, loss, calibration, and thresholding

What model family, loss, calibration, and thresholding choices would you try, and why?

What This Part Should Cover

  • Model choice with a baseline — a fine-tuned text encoder plus a cheap linear/GBM floor, with the acknowledgment that the label strategy matters more than the architecture.
  • Loss matched to the target and the imbalance — soft-target CE for soft labels, class-weighting/focal under skew, noise-robust losses if hard labels are forced.
  • Noise-aware regularization — early stopping on a clean fold, recognizing that flexible models memorize noisy labels late.
  • Explicit calibration and objective-driven thresholding — temperature/Platt/isotonic + ECE, threshold tuned on a cost or F1 objective, and threshold-free metrics reported alongside.

Part 5 — Why offline gains can be illusory

What failure modes and edge cases could make an offline metric improvement misleading or non-reproducible — and what's your guard for each? Aim for breadth: think across the labels, the splits, the metric itself, the data you may have cleaned, and the gap between offline data and production.

What This Part Should Cover

  • Breadth across the named axes — at least one concrete failure mode each for the labels, the splits, the metric, over-cleaning, and the offline-to-production gap.
  • A checkable guard per failure mode , not just naming the risk.
  • Statistical honesty — bootstrap/CIs on the test metric, requiring a gain to clear the noise and to hold on more than one metric and on subgroups.

What a Strong Answer Covers

These dimensions span all parts — an interviewer listens for them throughout, not in any single part:

  • Separating sources of disagreement — distinguishing intrinsic item ambiguity from annotator noise/bias, and treating them differently in every stage (analysis, targets, evaluation), rather than collapsing labels to ground truth.
  • Internal consistency across the pipeline — the target chosen in Part 3 propagates coherently into the loss (Part 4) and the way ambiguous items are scored (Parts 2 and 5), instead of five disconnected answers.
  • Evaluation-first discipline — treating the test harness as the load-bearing decision and being skeptical that any offline gain is real.

Follow-up Questions

Be ready to go deeper:

  • Your annotator-reliability model gives strange estimates (or won't converge) when redundancy is low or one class dominates. Why does it degenerate , and what do you fall back to?
  • The offline F1 improves by a point, but online metrics don't move after deployment. What are the likely explanations, and how would you have caught it earlier?
  • The production system won't carry annotator IDs at inference time. How does that constraint shape your split design and your choice of which generalization question to test?
Loading comments...

Browse More Questions

More Machine Learning•More OpenAI•More Machine Learning Engineer•OpenAI Machine Learning Engineer•OpenAI Machine Learning•Machine Learning Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.