PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/OpenAI

Debug a Broken Transformer

Last updated: Jun 21, 2026

Quick Overview

This question evaluates proficiency in debugging and reconfiguring Transformer-based deep learning models, covering competencies in model internals (attention mechanisms, positional encodings), data and label pipeline alignment, tensor shaping, gradient flow, and optimizer interactions within the Machine Learning domain (sequence models / transformers). It is commonly asked because it reveals practical troubleshooting ability and architectural understanding, testing primarily practical application with necessary conceptual reasoning about attention-specific failure modes and training versus evaluation behavior.

  • medium
  • OpenAI
  • Machine Learning
  • Machine Learning Engineer

Debug a Broken Transformer

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

You are handed a Transformer model implementation (PyTorch-style) that **does not train correctly** — the loss is not behaving as expected. The model runs a forward pass without crashing, so the bug is logical rather than a syntax error. Walk through how you would debug it **systematically, from the data input all the way to the optimizer**, naming the concrete checks you would run at each stage and why — then repurpose the model for classification and prove the change is wired up correctly. This question has two parts. ### Constraints & Assumptions - You have the full training script — data loader, model definition (embeddings, multi-head attention, FFN, normalization, positional encoding), loss, and training loop — and can add prints/breakpoints and run small experiments. - "Does not train correctly" means it executes but the loss misbehaves (flat / `NaN` / chance-level / eval-only failure); it is not a hard crash with a stack trace. - Assume a typical config (`d_model`, `num_heads`, `num_layers`, vocab size $V$, max length $L$) on a small dataset with a single GPU/CPU — favor cheap, high-signal checks over long runs. - The base model may be encoder-style or decoder/LM-style; call out where that distinction changes masking or the head. ### Clarifying Questions to Ask - What exactly is the symptom — flat loss, `NaN`/`Inf`, chance-level accuracy, or only the validation/eval path failing — and at which step does it appear? - Is this a from-scratch implementation or a fine-tune of a pretrained model, and was it ever known to work? - Is the model encoder-only (bidirectional), decoder-only (causal, next-token), or encoder-decoder, and what is the current task/loss? - Are mixed precision, gradient accumulation, or LR warmup/schedulers in play? - For the conversion: how many classes, single-label or multi-label, and is there a `[CLS]`-style token or pooling convention I should follow? ### Part 1 — Debug the broken Transformer The model executes but does not learn: the loss is flat (e.g. stuck near $\log V$), goes to `NaN`/diverges, accuracy never leaves chance level, or it trains but the eval path is broken. Lay out an ordered, hypothesis-driven debugging plan from data input to optimization. Explicitly address the common Transformer failure points — tokenization and label alignment, tensor shapes, attention masking, positional encoding, train-vs-eval mode, loss-function setup, gradient flow, and optimizer configuration — and for each say *what you would inspect* and *what a bug there looks like*. ```hint Triage the symptom first Before reading code, classify the **failure mode**. "Loss flat at $\log V$", "loss is `NaN`", "trains but validation is garbage", and "wrong output shape" each implicate a different cluster of bugs. Let the symptom prune the search space. ``` ```hint The single highest-signal experiment There is one cheap experiment that cleanly separates an *architecture / loss / optimizer* bug from a *data / generalization* problem — and tells you which half of the pipeline to stop investigating. What is it, and what should a correctly wired model do on a handful of examples? Name the experiment and reason through what each outcome rules in or out. ``` ```hint Where Transformer-specific bugs hide The bugs that "run but learn nonsense" tend to live in the parts unique to attention. Interrogate each: is the score scaling present, and what breaks without it? Which positions does the mask suppress, with what fill value, and along which axis does softmax normalize? Is positional information added the right number of times? Are the train/eval modes set so dropout and normalization behave? For each, decide what a wrong choice would silently look like. ``` ```hint Prove gradients actually flow Confirm the training loop steps are executed in the correct order and frequency, then inspect gradient norms at several points in the network. Zero gradients upstream of a certain layer mean something has broken the computation graph; verify that parameters actually change after one update step. ``` #### What This Part Should Cover - A **prioritized, hypothesis-driven** plan (reproduce → classify the symptom → cheapest high-signal checks first), not a flat checklist read top to bottom. - **Data-pipeline validation before model internals**: token-id validity, padding/special-token consistency across loader/mask/loss, label range and dtype, and correct input↔label alignment for the task type. - The **overfit-a-tiny-batch** test used as the decisive early experiment, with a clear reading of what each outcome rules in or out. - **Shape discipline** end to end: from embeddings through head split/merge to scores and back, with attention to transpose and reshape order. - **Attention correctness**: whether the candidate audits score scaling, mask shape/convention/fill value, and the softmax axis. - **Positional encoding, residuals, normalization placement, and train/eval mode** all handled correctly. - **Loss/optimizer wiring**: raw-logits-vs-probabilities expectation, target shape/dtype, padding handling, update-step ordering, finite nonzero grads, LR sanity, gradient clipping. - **Numerical-stability instincts** for `NaN`/divergence: what to inspect and in what order. ### Part 2 — Convert it to a classification model and verify As a follow-up, the interviewer asks you to repurpose this Transformer as a **sequence classification model** over $C$ classes. Describe the changes you would make, then describe a minimal verification — a **single forward and backward pass** on one batch — that proves the modified model is correctly wired before any real training run. ```hint The pooling gap The backbone emits one hidden vector per token position but a classifier needs exactly one vector per sequence. What creates this mismatch, and what approaches exist to bridge it? Think about how the architecture (encoder vs. decoder) and the presence or absence of special tokens might constrain your choice, and what can go wrong with padding. ``` ```hint What "verified" actually means A forward pass that doesn't crash is not verification. Push past "it ran": what specific **shape**, **finiteness**, **gradient-flow**, and **parameter-change** assertions, on one forward + backward + step, would actually prove the new head is wired to the rest of the model rather than dangling? Think about what a *disconnected* head would still pass, and design the checks to catch it. ``` #### Clarifying Questions for this Part - How many classes $C$, and is the task single-label (mutually exclusive) or multi-label? - Is there an existing `[CLS]`/`[BOS]` token to pool on, or should I add one / use masked mean-pooling? - Should the backbone be fine-tuned end to end, or frozen with only the new head trained? #### What This Part Should Cover - A sensible **pooling** choice appropriate to the architecture, with awareness of the padding pitfall. - A correct **classification head** replacing or bypassing the token-level output, emitting per-sequence logits. - The **matching loss** for the label type (single-label vs. binary/multi-label), including correct target dtype. - A **rigorous single-pass verification** with explicit shape, finiteness, **gradient-flow into the backbone (not just the head)**, and parameter-change assertions — designed to catch a head that is connected to nothing upstream. ### What a Strong Answer Covers These dimensions span both parts and distinguish a strong candidate: - **Engineering judgment over enumeration**: cheap, decisive experiments first; expensive hyperparameter tuning last. - **"Ran without crashing" is never treated as evidence of correctness** — the candidate insists on positive checks (shapes, finite loss, nonzero grads through the *whole* graph, an actual parameter delta after one step) in both the debug and the conversion. - **Reasoning about silent failure**: for each suspect component, naming what a *wrong* choice would look like rather than only what the right one is. ### Follow-up Questions - Your tiny-batch overfit test fails to reach near-zero loss. How do you localize whether the fault is in masking (Part 1), the loss, or gradient flow? - The loss sits exactly at $\log V$ for many steps and never moves. What does that number tell you, and where do you look first? - The loss is finite for ~50 steps then suddenly goes `NaN` — how does an intermittent failure change your approach versus an immediate one? - After the Part 2 conversion, training loss drops but validation accuracy is stuck at $1/C$. What classification-specific bugs would you suspect, and how would an automated regression test prevent this bug from silently reappearing after a refactor?

Quick Answer: This question evaluates proficiency in debugging and reconfiguring Transformer-based deep learning models, covering competencies in model internals (attention mechanisms, positional encodings), data and label pipeline alignment, tensor shaping, gradient flow, and optimizer interactions within the Machine Learning domain (sequence models / transformers). It is commonly asked because it reveals practical troubleshooting ability and architectural understanding, testing primarily practical application with necessary conceptual reasoning about attention-specific failure modes and training versus evaluation behavior.

Related Interview Questions

  • Implement 1NN with NumPy - OpenAI (medium)
  • Compute entropy and implement 1-NN - OpenAI (medium)
  • Defend a Research Direction and Experiment Design - OpenAI (medium)
  • Debug MiniGPT and Backpropagate Matmul - OpenAI (medium)
  • Implement Backprop for a Tiny Network - OpenAI (hard)
OpenAI logo
OpenAI
Mar 3, 2026, 12:00 AM
Machine Learning Engineer
Technical Screen
Machine Learning
74
0
Loading...

You are handed a Transformer model implementation (PyTorch-style) that does not train correctly — the loss is not behaving as expected. The model runs a forward pass without crashing, so the bug is logical rather than a syntax error. Walk through how you would debug it systematically, from the data input all the way to the optimizer, naming the concrete checks you would run at each stage and why — then repurpose the model for classification and prove the change is wired up correctly.

This question has two parts.

Constraints & Assumptions

  • You have the full training script — data loader, model definition (embeddings, multi-head attention, FFN, normalization, positional encoding), loss, and training loop — and can add prints/breakpoints and run small experiments.
  • "Does not train correctly" means it executes but the loss misbehaves (flat / NaN / chance-level / eval-only failure); it is not a hard crash with a stack trace.
  • Assume a typical config ( d_model , num_heads , num_layers , vocab size VVV , max length LLL ) on a small dataset with a single GPU/CPU — favor cheap, high-signal checks over long runs.
  • The base model may be encoder-style or decoder/LM-style; call out where that distinction changes masking or the head.

Clarifying Questions to Ask

  • What exactly is the symptom — flat loss, NaN / Inf , chance-level accuracy, or only the validation/eval path failing — and at which step does it appear?
  • Is this a from-scratch implementation or a fine-tune of a pretrained model, and was it ever known to work?
  • Is the model encoder-only (bidirectional), decoder-only (causal, next-token), or encoder-decoder, and what is the current task/loss?
  • Are mixed precision, gradient accumulation, or LR warmup/schedulers in play?
  • For the conversion: how many classes, single-label or multi-label, and is there a [CLS] -style token or pooling convention I should follow?

Part 1 — Debug the broken Transformer

The model executes but does not learn: the loss is flat (e.g. stuck near log⁡V\log VlogV), goes to NaN/diverges, accuracy never leaves chance level, or it trains but the eval path is broken. Lay out an ordered, hypothesis-driven debugging plan from data input to optimization. Explicitly address the common Transformer failure points — tokenization and label alignment, tensor shapes, attention masking, positional encoding, train-vs-eval mode, loss-function setup, gradient flow, and optimizer configuration — and for each say what you would inspect and what a bug there looks like.

What This Part Should Cover

  • A prioritized, hypothesis-driven plan (reproduce → classify the symptom → cheapest high-signal checks first), not a flat checklist read top to bottom.
  • Data-pipeline validation before model internals : token-id validity, padding/special-token consistency across loader/mask/loss, label range and dtype, and correct input↔label alignment for the task type.
  • The overfit-a-tiny-batch test used as the decisive early experiment, with a clear reading of what each outcome rules in or out.
  • Shape discipline end to end: from embeddings through head split/merge to scores and back, with attention to transpose and reshape order.
  • Attention correctness : whether the candidate audits score scaling, mask shape/convention/fill value, and the softmax axis.
  • Positional encoding, residuals, normalization placement, and train/eval mode all handled correctly.
  • Loss/optimizer wiring : raw-logits-vs-probabilities expectation, target shape/dtype, padding handling, update-step ordering, finite nonzero grads, LR sanity, gradient clipping.
  • Numerical-stability instincts for NaN /divergence: what to inspect and in what order.

Part 2 — Convert it to a classification model and verify

As a follow-up, the interviewer asks you to repurpose this Transformer as a sequence classification model over CCC classes. Describe the changes you would make, then describe a minimal verification — a single forward and backward pass on one batch — that proves the modified model is correctly wired before any real training run.

Clarifying Questions for this Part

  • How many classes CCC , and is the task single-label (mutually exclusive) or multi-label?
  • Is there an existing [CLS] / [BOS] token to pool on, or should I add one / use masked mean-pooling?
  • Should the backbone be fine-tuned end to end, or frozen with only the new head trained?

What This Part Should Cover

  • A sensible pooling choice appropriate to the architecture, with awareness of the padding pitfall.
  • A correct classification head replacing or bypassing the token-level output, emitting per-sequence logits.
  • The matching loss for the label type (single-label vs. binary/multi-label), including correct target dtype.
  • A rigorous single-pass verification with explicit shape, finiteness, gradient-flow into the backbone (not just the head) , and parameter-change assertions — designed to catch a head that is connected to nothing upstream.

What a Strong Answer Covers

These dimensions span both parts and distinguish a strong candidate:

  • Engineering judgment over enumeration : cheap, decisive experiments first; expensive hyperparameter tuning last.
  • "Ran without crashing" is never treated as evidence of correctness — the candidate insists on positive checks (shapes, finite loss, nonzero grads through the whole graph, an actual parameter delta after one step) in both the debug and the conversion.
  • Reasoning about silent failure : for each suspect component, naming what a wrong choice would look like rather than only what the right one is.

Follow-up Questions

  • Your tiny-batch overfit test fails to reach near-zero loss. How do you localize whether the fault is in masking (Part 1), the loss, or gradient flow?
  • The loss sits exactly at log⁡V\log VlogV for many steps and never moves. What does that number tell you, and where do you look first?
  • The loss is finite for ~50 steps then suddenly goes NaN — how does an intermittent failure change your approach versus an immediate one?
  • After the Part 2 conversion, training loss drops but validation accuracy is stuck at 1/C1/C1/C . What classification-specific bugs would you suspect, and how would an automated regression test prevent this bug from silently reappearing after a refactor?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More OpenAI•More Machine Learning Engineer•OpenAI Machine Learning Engineer•OpenAI Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.