How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Onsite rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at OpenAI during technical interviews.

Filter Bad Human Annotations | OpenAI Interview Question

Filter Bad Human Annotations

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Onsite

You are given a large training dataset labeled by human annotators. Some of those annotations are low quality — inconsistent, rushed, the result of misunderstood instructions, systematically biased, adversarial, or simply wrong. If you train on the data as-is, the noise will cap your model's ceiling and may bake in harmful biases. Design a practical, production-grade method to **identify and filter bad annotations before training**. The hard part is not detecting obvious mistakes; it is doing so without destroying *genuinely difficult* examples (which also tend to have low agreement) and without unfairly penalizing minority annotators or minority-data subgroups. Your design should work on real, messy, large-scale data — not only in a clean academic setup. Concretely, your answer should address: - The signals you would use at the **example level** and the **annotator level**. - How you distinguish a **hard / ambiguous example** from a **bad label** — they look similar (both produce disagreement). - Whether you would **remove, relabel, or down-weight** suspicious data, and how you decide per item. - How you would **build a scoring pipeline** that assigns a quality score to each annotation and routes it to an action. - How you would **evaluate** the filtering system (both label quality and downstream model impact). - The **failure modes and fairness risks** you would watch for. ```hint Where to start Resist treating this as one big "is this label bad?" classifier. There are really two unknowns tangled together — something about *who* is labeling and something about *which labels* are wrong — and they inform each other. Also ask what small, expensive resource could calibrate your thresholds so they're principled rather than guessed. ``` ```hint The core tension — hard vs. bad Disagreement alone cannot separate a hard example from a bad label, so a raw agreement threshold will betray you. Look for an *orthogonal* signal. A useful prompt: among annotators you have reason to trust, does genuine difficulty leave a different fingerprint than carelessness or error does? Think about what else (besides whether people agreed) you could estimate per example or per annotator to break the tie. ``` ```hint Annotator reliability without ground truth You usually don't have ground truth, so "accuracy on a gold set" is both scarce and confounded by example difficulty. Consider whether the redundant labels you already have could let you estimate each annotator's quality *and* the likely true labels at the same time, without grading against answers — and what kind of statistical machinery does that kind of joint, chicken-and-egg estimation. Watch the cost: items with only one label give such an approach little to work with. ``` ```hint Action, not just a score A scalar quality score is only useful if it drives a *decision*, so think about the menu of actions available beyond a simple keep/delete switch, and which one each kind of item should map to. Push yourself on what to do with the ambiguous middle in particular, given that throwing data away is irreversible and your hardest examples live there. ``` ```hint Pitfalls to pre-empt Strong answers name their own failure modes before the interviewer does. Brainstorm where this pipeline could quietly hurt you — think about anything you trust as a stand-in for "truth," what happens to under-represented data under any agreement-based rule, and why a "cleaner-looking" dataset might actually be worse. For each one you raise, be ready to attach a concrete guardrail rather than just naming the risk. ``` ### Constraints & Assumptions State your own where the problem is silent, but design against roughly this scale and setting: - ~10M annotations, ~1M unique examples, produced by ~2,000–5,000 annotators over time (each example may have 1–5 labels). - Tasks are a mix of **classification** (single/multi-label) and **structured outputs** (spans, bounding boxes, JSON with a schema). Your signals should generalize across both. - You can afford to expert-review only a small fraction (≪1%) of items; expert relabeling is the scarcest resource. - A reasonably good baseline model exists (or can be trained), but it is **imperfect and possibly biased** — it must not be the sole arbiter of truth. - The data spans multiple languages, demographic segments, and product surfaces; some are under-represented. - The pipeline should run continuously as new data arrives, not just once. ### Clarifying Questions to Ask - What is the **label redundancy** — how many independent annotations per example, and is it uniform or only on a sampled subset? - Is there an existing **trusted/gold set**, or do we need to bootstrap one? How much expert review budget do we have per week? - What is the **task type and output schema** (single-label, multi-label, spans, structured JSON)? Are there hard validity rules we can check programmatically? - What does "bad" cost us — is the downstream model more sensitive to **false rejects** (losing hard data) or **false accepts** (keeping noise)? - Are annotator identities, timestamps, and interaction metadata (time-on-task, edits, skips) available, and are there privacy/policy limits on using them? - Which **subgroups, languages, or product surfaces** must we protect from disparate filtering, and do we have segment labels to measure that? ### What a Strong Answer Covers - A clear **taxonomy of bad-annotation causes** (random error, low effort, instruction misunderstanding, systematic bias, adversarial, intrinsic ambiguity) and why each needs a different response. - **Two coupled estimators**: annotator reliability (ideally confusion-matrix / EM-style, not just gold accuracy) and per-example label-correctness, anchored to a gold set. - A concrete **multi-signal scoring pipeline** with both example-level and annotator-level signals, and an explicit story for how signals are normalized and combined. - A principled **hard-vs-bad disambiguation** that doesn't reduce to "low agreement = bad." - **Graded actions** (keep / down-weight / soft-label / expert-review / reject) rather than binary delete, with the reasoning for choosing per band. - **Two-track evaluation**: detection precision/recall on audited samples *and* downstream model lift (held-out trusted set, calibration, hard-slice and per-subgroup metrics). - Explicit **failure modes and fairness guardrails**, each paired with a mitigation, plus the **continuous/online** operation of the pipeline. ### Follow-up Questions - The baseline model you use to flag "model disagreement" was itself trained on this noisy data. How do you prevent the filter from amplifying the model's existing biases (a self-confirming feedback loop)? - An annotator has *high* agreement with everyone but is systematically wrong on one rare class. How does your system catch a *confidently consistent* error, not just a random or low-effort one? - You filter the data, retrain, and overall accuracy goes up but performance on one low-resource language drops. Walk through how you'd diagnose whether the filter caused it and how you'd fix it without reverting the whole pipeline. - How would you make the pipeline **robust to adversarial annotators** who learn the quality checks (e.g. pass the gold items but corrupt real items)?

Quick Answer: This question evaluates expertise in data quality and annotation filtering for machine learning, including annotator reliability modeling, noisy-label detection, fairness-aware filtering, and assessing downstream model impact.

Design a practical, production-grade method to identify and filter bad annotations before training. The hard part is not detecting obvious mistakes; it is doing so without destroying genuinely difficult examples (which also tend to have low agreement) and without unfairly penalizing minority annotators or minority-data subgroups. Your design should work on real, messy, large-scale data — not only in a clean academic setup.

Concretely, your answer should address:

The signals you would use at the example level and the annotator level .
How you distinguish a hard / ambiguous example from a bad label — they look similar (both produce disagreement).
Whether you would remove, relabel, or down-weight suspicious data, and how you decide per item.
How you would build a scoring pipeline that assigns a quality score to each annotation and routes it to an action.
How you would evaluate the filtering system (both label quality and downstream model impact).
The failure modes and fairness risks you would watch for.

Constraints & Assumptions

State your own where the problem is silent, but design against roughly this scale and setting:

~10M annotations, ~1M unique examples, produced by ~2,000–5,000 annotators over time (each example may have 1–5 labels).
Tasks are a mix of classification (single/multi-label) and structured outputs (spans, bounding boxes, JSON with a schema). Your signals should generalize across both.
You can afford to expert-review only a small fraction (≪1%) of items; expert relabeling is the scarcest resource.
A reasonably good baseline model exists (or can be trained), but it is imperfect and possibly biased — it must not be the sole arbiter of truth.
The data spans multiple languages, demographic segments, and product surfaces; some are under-represented.
The pipeline should run continuously as new data arrives, not just once.

Clarifying Questions to Ask

What is the label redundancy — how many independent annotations per example, and is it uniform or only on a sampled subset?
Is there an existing trusted/gold set , or do we need to bootstrap one? How much expert review budget do we have per week?
What is the task type and output schema (single-label, multi-label, spans, structured JSON)? Are there hard validity rules we can check programmatically?
What does "bad" cost us — is the downstream model more sensitive to false rejects (losing hard data) or false accepts (keeping noise)?
Are annotator identities, timestamps, and interaction metadata (time-on-task, edits, skips) available, and are there privacy/policy limits on using them?
Which subgroups, languages, or product surfaces must we protect from disparate filtering, and do we have segment labels to measure that?

What a Strong Answer Covers

A clear taxonomy of bad-annotation causes (random error, low effort, instruction misunderstanding, systematic bias, adversarial, intrinsic ambiguity) and why each needs a different response.
Two coupled estimators : annotator reliability (ideally confusion-matrix / EM-style, not just gold accuracy) and per-example label-correctness, anchored to a gold set.
A concrete multi-signal scoring pipeline with both example-level and annotator-level signals, and an explicit story for how signals are normalized and combined.
A principled hard-vs-bad disambiguation that doesn't reduce to "low agreement = bad."
Graded actions (keep / down-weight / soft-label / expert-review / reject) rather than binary delete, with the reasoning for choosing per band.
Two-track evaluation : detection precision/recall on audited samples and downstream model lift (held-out trusted set, calibration, hard-slice and per-subgroup metrics).
Explicit failure modes and fairness guardrails , each paired with a mitigation, plus the continuous/online operation of the pipeline.

Follow-up Questions

The baseline model you use to flag "model disagreement" was itself trained on this noisy data. How do you prevent the filter from amplifying the model's existing biases (a self-confirming feedback loop)?
An annotator has high agreement with everyone but is systematically wrong on one rare class. How does your system catch a confidently consistent error, not just a random or low-effort one?
You filter the data, retrain, and overall accuracy goes up but performance on one low-resource language drops. Walk through how you'd diagnose whether the filter caused it and how you'd fix it without reverting the whole pipeline.
How would you make the pipeline robust to adversarial annotators who learn the quality checks (e.g. pass the gold items but corrupt real items)?

Filter Bad Human Annotations

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Onsite

Filter Bad Human Annotations

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Filter Bad Human Annotations

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP