How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Technical Screen rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at OpenAI during technical interviews.

Debug a Broken Transformer | OpenAI Interview Question

Q: Debug a Broken Transformer

This question evaluates proficiency in debugging and reconfiguring Transformer-based deep learning models, covering competencies in model internals (attention mechanisms, positional encodings), data and label pipeline alignment, tensor shaping, gradient flow, and optimizer interactions within the Machine Learning domain (sequence models / transformers). It is commonly asked because it reveals practical troubleshooting ability and architectural understanding, testing primarily practical application with necessary conceptual reasoning about attention-specific failure modes and training versus evaluation behavior.

You are handed a Transformer model implementation (PyTorch-style) that does not train correctly — the loss is not behaving as expected. The model runs a forward pass without crashing, so the bug is logical rather than a syntax error. Walk through how you would debug it systematically, from the data input all the way to the optimizer, naming the concrete checks you would run at each stage and why — then repurpose the model for classification and prove the change is wired up correctly.

This question has two parts.

Constraints & Assumptions

You have the full training script — data loader, model definition (embeddings, multi-head attention, FFN, normalization, positional encoding), loss, and training loop — and can add prints/breakpoints and run small experiments.
"Does not train correctly" means it executes but the loss misbehaves (flat / NaN / chance-level / eval-only failure); it is not a hard crash with a stack trace.
Assume a typical config ( d_model , num_heads , num_layers , vocab size $V$ , max length $L$ ) on a small dataset with a single GPU/CPU — favor cheap, high-signal checks over long runs.
The base model may be encoder-style or decoder/LM-style; call out where that distinction changes masking or the head.

Clarifying Questions to Ask

What exactly is the symptom — flat loss, NaN / Inf , chance-level accuracy, or only the validation/eval path failing — and at which step does it appear?
Is this a from-scratch implementation or a fine-tune of a pretrained model, and was it ever known to work?
Is the model encoder-only (bidirectional), decoder-only (causal, next-token), or encoder-decoder, and what is the current task/loss?
Are mixed precision, gradient accumulation, or LR warmup/schedulers in play?
For the conversion: how many classes, single-label or multi-label, and is there a [CLS] -style token or pooling convention I should follow?

Part 1 — Debug the broken Transformer

The model executes but does not learn: the loss is flat (e.g. stuck near $\log V$ ), goes to NaN/diverges, accuracy never leaves chance level, or it trains but the eval path is broken. Lay out an ordered, hypothesis-driven debugging plan from data input to optimization. Explicitly address the common Transformer failure points — tokenization and label alignment, tensor shapes, attention masking, positional encoding, train-vs-eval mode, loss-function setup, gradient flow, and optimizer configuration — and for each say what you would inspect and what a bug there looks like.

What This Part Should Cover

A prioritized, hypothesis-driven plan (reproduce → classify the symptom → cheapest high-signal checks first), not a flat checklist read top to bottom.
Data-pipeline validation before model internals : token-id validity, padding/special-token consistency across loader/mask/loss, label range and dtype, and correct input↔label alignment for the task type.
The overfit-a-tiny-batch test used as the decisive early experiment, with a clear reading of what each outcome rules in or out.
Shape discipline end to end: from embeddings through head split/merge to scores and back, with attention to transpose and reshape order.
Attention correctness : whether the candidate audits score scaling, mask shape/convention/fill value, and the softmax axis.
Positional encoding, residuals, normalization placement, and train/eval mode all handled correctly.
Loss/optimizer wiring : raw-logits-vs-probabilities expectation, target shape/dtype, padding handling, update-step ordering, finite nonzero grads, LR sanity, gradient clipping.
Numerical-stability instincts for NaN /divergence: what to inspect and in what order.

Part 2 — Convert it to a classification model and verify

As a follow-up, the interviewer asks you to repurpose this Transformer as a sequence classification model over $C$ classes. Describe the changes you would make, then describe a minimal verification — a single forward and backward pass on one batch — that proves the modified model is correctly wired before any real training run.

Clarifying Questions for this Part

How many classes $C$ , and is the task single-label (mutually exclusive) or multi-label?
Is there an existing [CLS] / [BOS] token to pool on, or should I add one / use masked mean-pooling?
Should the backbone be fine-tuned end to end, or frozen with only the new head trained?

What This Part Should Cover

A sensible pooling choice appropriate to the architecture, with awareness of the padding pitfall.
A correct classification head replacing or bypassing the token-level output, emitting per-sequence logits.
The matching loss for the label type (single-label vs. binary/multi-label), including correct target dtype.
A rigorous single-pass verification with explicit shape, finiteness, gradient-flow into the backbone (not just the head) , and parameter-change assertions — designed to catch a head that is connected to nothing upstream.

What a Strong Answer Covers

These dimensions span both parts and distinguish a strong candidate:

Engineering judgment over enumeration : cheap, decisive experiments first; expensive hyperparameter tuning last.
"Ran without crashing" is never treated as evidence of correctness — the candidate insists on positive checks (shapes, finite loss, nonzero grads through the whole graph, an actual parameter delta after one step) in both the debug and the conversion.
Reasoning about silent failure : for each suspect component, naming what a wrong choice would look like rather than only what the right one is.

Follow-up Questions

Your tiny-batch overfit test fails to reach near-zero loss. How do you localize whether the fault is in masking (Part 1), the loss, or gradient flow?
The loss sits exactly at $\log V$ for many steps and never moves. What does that number tell you, and where do you look first?
The loss is finite for ~50 steps then suddenly goes NaN — how does an intermittent failure change your approach versus an immediate one?
After the Part 2 conversion, training loss drops but validation accuracy is stuck at $1/C$ . What classification-specific bugs would you suspect, and how would an automated regression test prevent this bug from silently reappearing after a refactor?

This question has two parts.

Constraints & Assumptions

You have the full training script — data loader, model definition (embeddings, multi-head attention, FFN, normalization, positional encoding), loss, and training loop — and can add prints/breakpoints and run small experiments.
"Does not train correctly" means it executes but the loss misbehaves (flat / NaN / chance-level / eval-only failure); it is not a hard crash with a stack trace.
Assume a typical config ( d_model , num_heads , num_layers , vocab size $V$ , max length $L$ ) on a small dataset with a single GPU/CPU — favor cheap, high-signal checks over long runs.
The base model may be encoder-style or decoder/LM-style; call out where that distinction changes masking or the head.

Clarifying Questions to Ask

What exactly is the symptom — flat loss, NaN / Inf , chance-level accuracy, or only the validation/eval path failing — and at which step does it appear?
Is this a from-scratch implementation or a fine-tune of a pretrained model, and was it ever known to work?
Is the model encoder-only (bidirectional), decoder-only (causal, next-token), or encoder-decoder, and what is the current task/loss?
Are mixed precision, gradient accumulation, or LR warmup/schedulers in play?
For the conversion: how many classes, single-label or multi-label, and is there a [CLS] -style token or pooling convention I should follow?

Part 1 — Debug the broken Transformer

What This Part Should Cover

A prioritized, hypothesis-driven plan (reproduce → classify the symptom → cheapest high-signal checks first), not a flat checklist read top to bottom.
Data-pipeline validation before model internals : token-id validity, padding/special-token consistency across loader/mask/loss, label range and dtype, and correct input↔label alignment for the task type.
The overfit-a-tiny-batch test used as the decisive early experiment, with a clear reading of what each outcome rules in or out.
Shape discipline end to end: from embeddings through head split/merge to scores and back, with attention to transpose and reshape order.
Attention correctness : whether the candidate audits score scaling, mask shape/convention/fill value, and the softmax axis.
Positional encoding, residuals, normalization placement, and train/eval mode all handled correctly.
Loss/optimizer wiring : raw-logits-vs-probabilities expectation, target shape/dtype, padding handling, update-step ordering, finite nonzero grads, LR sanity, gradient clipping.
Numerical-stability instincts for NaN /divergence: what to inspect and in what order.

Part 2 — Convert it to a classification model and verify

Clarifying Questions for this Part

How many classes $C$ , and is the task single-label (mutually exclusive) or multi-label?
Is there an existing [CLS] / [BOS] token to pool on, or should I add one / use masked mean-pooling?
Should the backbone be fine-tuned end to end, or frozen with only the new head trained?

What This Part Should Cover

A sensible pooling choice appropriate to the architecture, with awareness of the padding pitfall.
A correct classification head replacing or bypassing the token-level output, emitting per-sequence logits.
The matching loss for the label type (single-label vs. binary/multi-label), including correct target dtype.
A rigorous single-pass verification with explicit shape, finiteness, gradient-flow into the backbone (not just the head) , and parameter-change assertions — designed to catch a head that is connected to nothing upstream.

What a Strong Answer Covers

These dimensions span both parts and distinguish a strong candidate:

Engineering judgment over enumeration : cheap, decisive experiments first; expensive hyperparameter tuning last.
"Ran without crashing" is never treated as evidence of correctness — the candidate insists on positive checks (shapes, finite loss, nonzero grads through the whole graph, an actual parameter delta after one step) in both the debug and the conversion.
Reasoning about silent failure : for each suspect component, naming what a wrong choice would look like rather than only what the right one is.

Follow-up Questions

Your tiny-batch overfit test fails to reach near-zero loss. How do you localize whether the fault is in masking (Part 1), the loss, or gradient flow?
The loss sits exactly at $\log V$ for many steps and never moves. What does that number tell you, and where do you look first?
The loss is finite for ~50 steps then suddenly goes NaN — how does an intermittent failure change your approach versus an immediate one?
After the Part 2 conversion, training loss drops but validation accuracy is stuck at $1/C$ . What classification-specific bugs would you suspect, and how would an automated regression test prevent this bug from silently reappearing after a refactor?

Debug a Broken Transformer

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Debug the broken Transformer

What This Part Should Cover

Part 2 — Convert it to a classification model and verify

Clarifying Questions for this Part

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Debug a Broken Transformer

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Debug the broken Transformer

What This Part Should Cover

Part 2 — Convert it to a classification model and verify

Clarifying Questions for this Part

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP