How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a hard difficulty Machine Learning question, commonly asked during Onsite rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at OpenAI during technical interviews.

Improve Training With Noisy Annotators | OpenAI Interview Question

Improve Training With Noisy Annotators

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Onsite

You are given a labeled training dataset as a Pandas DataFrame. Each row contains feature columns, an observed `label`, and an `annotator_id` identifying who produced that label. The annotators vary in quality, so some labels are noisy or wrong. You are also given baseline model-training code that trains a classifier on the raw dataset and reports a validation metric. Your task: **design and implement a data-cleaning, relabeling, or reweighting approach that improves the model's validation performance over the baseline**, using only the training data — without consulting the validation/test labels to decide what to clean. This is a hands-on, open-ended applied-ML exercise: you write code against the DataFrame, run experiments, and defend your choices. Expect the interviewer to interleave fundamentals questions (precision, recall, F1) while you work. In your solution, walk through and implement how you would: 1. Establish and interpret the baseline. 2. Measure label quality and annotator reliability. 3. Clean, relabel, remove, or reweight examples based on those estimates. 4. Retrain the model and decide whether the change genuinely improved performance. 5. Explain the basic classification metrics that come up (precision, recall, F1, and how class imbalance changes their interpretation). You may operate on the DataFrame directly or convert to NumPy; demonstrate comfort with `groupby`, `merge`, and vectorized/boolean indexing. ```hint Where to start Resist changing anything until you have characterized the baseline beyond the headline metric: look per-class, look at the confusion matrix, look at class balance — and ask whether the errors cluster by `annotator_id`. Also nail down the data regime first, because the whole strategy turns on it: is each item labeled by one annotator or several, and are any labels trusted/expert? ``` ```hint Find the noisy signal without leaking You must flag likely-mislabeled rows without ever looking at the validation labels. Watch for the obvious trap: a model's loss on the exact rows it trained on tells you little, because it can drive that loss to zero on the very examples it memorized — including the wrong ones. What kind of prediction would let you judge a row *as if the model had never seen it*? ``` ```hint Aggregating multiple annotators If several annotators label the same item, ask whether you can beat plain majority vote. What would you need to estimate about each individual annotator to weight their vote — and could you estimate the items' true labels and each annotator's tendencies jointly, even with no trusted labels at all? If trusted labels do exist, think about why an annotator with very few labels shouldn't be ranked on raw accuracy. ``` ```hint Act on the signal, gently Deleting suspect rows is the most destructive move and shrinks your data — reach for it last. Order your options from gentle to drastic and prefer the gentlest that survives the held-out check: can you keep every row but make a suspect one *count less*, or keep it and *soften its target*, before you delete anything? Whatever you choose, re-validate on the untouched split across multiple seeds. ``` ### Constraints & Assumptions - Input is a Pandas DataFrame: feature columns + `label` + `annotator_id`. The label space is small (binary or low-cardinality multiclass). - A frozen validation split and baseline training code are provided; you optimize a single primary validation metric. Retraining is cheap enough to run several times (multiple seeds / folds). - The data regime is **not specified up front** — you may or may not have repeated annotations per item, a small set of gold labels, or only one annotation per row. Your method must adapt to what you find. - Label noise is **annotator-driven** (quality varies by who labeled), not purely random per row. - The validation set is held out: cleaning decisions must never consult validation/test labels. - Classes may be imbalanced. ### Clarifying Questions to Ask - Do some examples have trusted gold/expert labels, or is everything annotator-provided? - Is each example labeled by one annotator or by multiple annotators (do annotations overlap)? - What is the primary validation metric I'm optimizing — accuracy, macro-F1, AUROC — and is there a per-class requirement? - How imbalanced are the classes, and is one class more costly to get wrong? - Is the noise believed to be random (label flips) or systematic (specific annotators or specific feature regions)? - Does the training code accept sample weights or soft/probabilistic targets, or only hard labels? ### What a Strong Answer Covers - **Baseline rigor**: trains the unchanged model first and records the right diagnostics (the primary metric, class balance, confusion matrix, per-class and per-annotator error breakdowns) before touching the data. - **Leakage-free noise estimation**: uses out-of-fold predictions, not same-row training loss, and never touches the eval split. - **Principled reliability estimation** matched to the regime (gold-calibrated, multi-annotator agreement/EM, or single-annotation distributional/model-based signals), with smoothing for low-count annotators. - **A graded intervention** (reweight → soft relabel → filter), chosen for the situation rather than a one-size deletion. - **Honest evaluation**: fixed validation split, stability across seeds/folds, attention to minority-class harm, and willingness to report a null result with a plausible causal story. - **Metric fluency**: precise definitions of precision/recall/F1 and how imbalance and macro vs. weighted averaging change the interpretation. ### Follow-up Questions - Walk through the E and M steps of a Dawid–Skene estimator. What is its failure mode when most annotators agree on the *wrong* answer? - One annotator labeled only 12 examples and looks "100% accurate." Why is that misleading, and how do you handle low-count annotators? - The noise is **systematic** — one annotator consistently confuses two specific classes. How would your estimate and your fix differ from the random-flip case? - Your cleaning improves macro-F1 but lowers overall accuracy. How do you decide whether to ship it? And if the validation set was labeled by the same noisy annotators, how does that change how much you trust the measured gain?

Quick Answer: This question evaluates a candidate's ability to handle label noise by estimating annotator reliability and applying data-cleaning, relabeling, or reweighting methods while measuring model performance and understanding classification metrics.

You are given a labeled training dataset as a Pandas DataFrame. Each row contains feature columns, an observed label, and an annotator_id identifying who produced that label. The annotators vary in quality, so some labels are noisy or wrong. You are also given baseline model-training code that trains a classifier on the raw dataset and reports a validation metric.

Your task: design and implement a data-cleaning, relabeling, or reweighting approach that improves the model's validation performance over the baseline, using only the training data — without consulting the validation/test labels to decide what to clean.

This is a hands-on, open-ended applied-ML exercise: you write code against the DataFrame, run experiments, and defend your choices. Expect the interviewer to interleave fundamentals questions (precision, recall, F1) while you work. In your solution, walk through and implement how you would:

Establish and interpret the baseline.
Measure label quality and annotator reliability.
Clean, relabel, remove, or reweight examples based on those estimates.
Retrain the model and decide whether the change genuinely improved performance.
Explain the basic classification metrics that come up (precision, recall, F1, and how class imbalance changes their interpretation).

You may operate on the DataFrame directly or convert to NumPy; demonstrate comfort with groupby, merge, and vectorized/boolean indexing.

Constraints & Assumptions

Input is a Pandas DataFrame: feature columns + label + annotator_id . The label space is small (binary or low-cardinality multiclass).
A frozen validation split and baseline training code are provided; you optimize a single primary validation metric. Retraining is cheap enough to run several times (multiple seeds / folds).
The data regime is not specified up front — you may or may not have repeated annotations per item, a small set of gold labels, or only one annotation per row. Your method must adapt to what you find.
Label noise is annotator-driven (quality varies by who labeled), not purely random per row.
The validation set is held out: cleaning decisions must never consult validation/test labels.
Classes may be imbalanced.

Clarifying Questions to Ask

Do some examples have trusted gold/expert labels, or is everything annotator-provided?
Is each example labeled by one annotator or by multiple annotators (do annotations overlap)?
What is the primary validation metric I'm optimizing — accuracy, macro-F1, AUROC — and is there a per-class requirement?
How imbalanced are the classes, and is one class more costly to get wrong?
Is the noise believed to be random (label flips) or systematic (specific annotators or specific feature regions)?
Does the training code accept sample weights or soft/probabilistic targets, or only hard labels?

What a Strong Answer Covers

Baseline rigor : trains the unchanged model first and records the right diagnostics (the primary metric, class balance, confusion matrix, per-class and per-annotator error breakdowns) before touching the data.
Leakage-free noise estimation : uses out-of-fold predictions, not same-row training loss, and never touches the eval split.
Principled reliability estimation matched to the regime (gold-calibrated, multi-annotator agreement/EM, or single-annotation distributional/model-based signals), with smoothing for low-count annotators.
A graded intervention (reweight → soft relabel → filter), chosen for the situation rather than a one-size deletion.
Honest evaluation : fixed validation split, stability across seeds/folds, attention to minority-class harm, and willingness to report a null result with a plausible causal story.
Metric fluency : precise definitions of precision/recall/F1 and how imbalance and macro vs. weighted averaging change the interpretation.

Follow-up Questions

Walk through the E and M steps of a Dawid–Skene estimator. What is its failure mode when most annotators agree on the wrong answer?
One annotator labeled only 12 examples and looks "100% accurate." Why is that misleading, and how do you handle low-count annotators?
The noise is systematic — one annotator consistently confuses two specific classes. How would your estimate and your fix differ from the random-flip case?
Your cleaning improves macro-F1 but lowers overall accuracy. How do you decide whether to ship it? And if the validation set was labeled by the same noisy annotators, how does that change how much you trust the measured gain?

Improve Training With Noisy Annotators

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Onsite

Establish and interpret the baseline.
Measure label quality and annotator reliability.
Clean, relabel, remove, or reweight examples based on those estimates.
Retrain the model and decide whether the change genuinely improved performance.
Explain the basic classification metrics that come up (precision, recall, F1, and how class imbalance changes their interpretation).

You may operate on the DataFrame directly or convert to NumPy; demonstrate comfort with groupby, merge, and vectorized/boolean indexing.

Constraints & Assumptions

Input is a Pandas DataFrame: feature columns + label + annotator_id . The label space is small (binary or low-cardinality multiclass).
A frozen validation split and baseline training code are provided; you optimize a single primary validation metric. Retraining is cheap enough to run several times (multiple seeds / folds).
The data regime is not specified up front — you may or may not have repeated annotations per item, a small set of gold labels, or only one annotation per row. Your method must adapt to what you find.
Label noise is annotator-driven (quality varies by who labeled), not purely random per row.
The validation set is held out: cleaning decisions must never consult validation/test labels.
Classes may be imbalanced.

Clarifying Questions to Ask

Do some examples have trusted gold/expert labels, or is everything annotator-provided?
Is each example labeled by one annotator or by multiple annotators (do annotations overlap)?
What is the primary validation metric I'm optimizing — accuracy, macro-F1, AUROC — and is there a per-class requirement?
How imbalanced are the classes, and is one class more costly to get wrong?
Is the noise believed to be random (label flips) or systematic (specific annotators or specific feature regions)?
Does the training code accept sample weights or soft/probabilistic targets, or only hard labels?

What a Strong Answer Covers

Baseline rigor : trains the unchanged model first and records the right diagnostics (the primary metric, class balance, confusion matrix, per-class and per-annotator error breakdowns) before touching the data.
Leakage-free noise estimation : uses out-of-fold predictions, not same-row training loss, and never touches the eval split.
Principled reliability estimation matched to the regime (gold-calibrated, multi-annotator agreement/EM, or single-annotation distributional/model-based signals), with smoothing for low-count annotators.
A graded intervention (reweight → soft relabel → filter), chosen for the situation rather than a one-size deletion.
Honest evaluation : fixed validation split, stability across seeds/folds, attention to minority-class harm, and willingness to report a null result with a plausible causal story.
Metric fluency : precise definitions of precision/recall/F1 and how imbalance and macro vs. weighted averaging change the interpretation.

Follow-up Questions

Walk through the E and M steps of a Dawid–Skene estimator. What is its failure mode when most annotators agree on the wrong answer?
One annotator labeled only 12 examples and looks "100% accurate." Why is that misleading, and how do you handle low-count annotators?
The noise is systematic — one annotator consistently confuses two specific classes. How would your estimate and your fix differ from the random-flip case?
Your cleaning improves macro-F1 but lowers overall accuracy. How do you decide whether to ship it? And if the validation set was labeled by the same noisy annotators, how does that change how much you trust the measured gain?

Improve Training With Noisy Annotators

Quick Overview

Improve Training With Noisy Annotators

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Write your answer

Improve Training With Noisy Annotators

Quick Overview

Improve Training With Noisy Annotators

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Write your answer