PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/OpenAI

Debug a transformer training pipeline

Last updated: Jun 15, 2026

Quick Overview

An OpenAI ML engineer technical-screen debugging question: given a PyTorch Transformer training pipeline whose loss diverges and validation accuracy stays near chance (plus mask IndexErrors, AMP crashes, and nondeterminism), identify the distinct bugs and fix them. It tests Transformer internals - attention/causal masking, teacher-forcing label shift, positional encodings, ignore_index loss, optimizer and GradScaler order, determinism, and DDP - along with writing unit tests and a structured debugging plan.

  • hard
  • OpenAI
  • Machine Learning
  • Machine Learning Engineer

Debug a transformer training pipeline

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

##### Question You are given a PyTorch Transformer-based training pipeline (a multi-head attention encoder-decoder, with tokenization, padding/masking, AdamW with weight decay, gradient clipping, a cosine LR scheduler with warmup, mixed precision, and teacher forcing) that misbehaves. The training loss decreases or even diverges while validation accuracy stays near chance, and the run also exhibits intermittent crashes and irreproducible results. Concretely, the pipeline shows the following symptoms: 1. Occasional CUDA shape/`IndexError`s around attention masks. 2. Validation metrics plateau near chance even though training loss decreases (and in some runs the loss diverges entirely). 3. Intermittent crashes under mixed precision (AMP) combined with gradient accumulation. 4. Nondeterministic results across runs with the same config. Answer the following: 1. Identify at least four distinct, plausible root-cause bugs behind these symptoms. Draw from the standard failure surface, e.g.: incorrect attention/causal mask construction (shape, dtype, device, or polarity); off-by-one teacher-forcing label shift and PAD tokens not ignored in the loss; missing or mis-placed `optimizer.zero_grad` (especially with gradient accumulation); misplaced gradient scaling / clipping before unscale under AMP; wrong positional-encoding shape or broadcast (or position indices exceeding the embedding table); incorrectly excluding LayerNorm/bias from weight decay; and data leakage in batching. 2. For each bug, explain the failure mode (the signal you would look for), show a minimal code fix, and write a unit test or runtime assertion that would catch it and prevent regressions. 3. Lay out a systematic debugging plan that localizes each fault component-by-component. Give concrete checks and experiments for: data preprocessing (padding, truncation, label alignment); tokenization and attention/causal masks; positional encodings; loss computation (`ignore_index`, label smoothing, class weights); optimizer/scheduler/`zero_grad`/gradient clipping; AMP/`GradScaler` settings; seed control and deterministic kernels; and DDP/Sampler configuration. 4. Specify the sanity checks (e.g., overfit a tiny subset / copy task, gradient and activation statistics, NaN/Inf detection), the metrics you would monitor, and a small experiment to verify each fix.

Quick Answer: An OpenAI ML engineer technical-screen debugging question: given a PyTorch Transformer training pipeline whose loss diverges and validation accuracy stays near chance (plus mask IndexErrors, AMP crashes, and nondeterminism), identify the distinct bugs and fix them. It tests Transformer internals - attention/causal masking, teacher-forcing label shift, positional encodings, ignore_index loss, optimizer and GradScaler order, determinism, and DDP - along with writing unit tests and a structured debugging plan.

Related Interview Questions

  • Implement 1NN with NumPy - OpenAI (medium)
  • Compute entropy and implement 1-NN - OpenAI (medium)
  • Defend a Research Direction and Experiment Design - OpenAI (medium)
  • Implement Backprop for a Tiny Network - OpenAI (hard)
  • Debug MiniGPT and Backpropagate Matmul - OpenAI (medium)
|Home/Machine Learning/OpenAI

Debug a transformer training pipeline

OpenAI logo
OpenAI
Jul 31, 2025, 12:00 AM
hardMachine Learning EngineerTechnical ScreenMachine Learning
43
0

Debug a Transformer training pipeline

You are handed a PyTorch Transformer encoder–decoder training pipeline that misbehaves. The pipeline includes tokenization, padding/masking, multi-head attention, AdamW with weight decay, gradient clipping, a cosine LR scheduler with warmup, mixed precision (AMP), and teacher forcing.

Its behavior is inconsistent: training loss decreases (or sometimes diverges entirely) while validation accuracy stays near chance, and the run also crashes intermittently and produces irreproducible results. This is a debugging exercise — the interviewer is assessing your method (reproduce, isolate, fix, verify) as much as the specific bugs you name.

Observed symptoms

  1. Occasional CUDA shape / IndexError s around attention masks.
  2. Validation metrics plateau near chance even though training loss decreases (and in some runs the loss diverges entirely).
  3. Intermittent crashes under mixed precision (AMP) combined with gradient accumulation .
  4. Nondeterministic results across runs with the same config.

Clarifying Questions to Ask

Before diving in, scope the problem with the interviewer:

  • Are we using nn.Transformer (or a hand-rolled attention block), and is it batch_first=True ? What are the mask conventions (boolean vs. additive, and which polarity means "masked out")?
  • What are the known-good values for PAD_ID , BOS_ID , and EOS_ID , and which one is used as the loss ignore_index ?
  • Are positional encodings sinusoidal (fixed) or learned (with a max_position_embeddings cap)?
  • Is the run single-GPU or DDP / multi-node, and is the nondeterminism observed on a single device or only across ranks?
  • Are AMP precision (fp16 vs. bf16) and the gradient-accumulation step count fixed, or are we free to change them while debugging?
  • Do we have the freedom to add instrumentation and a tiny synthetic dataset, or must we reproduce on the real data only?

Constraints & Assumptions

State your conventions explicitly so your fixes are unambiguous (the interviewer will ask). A reasonable default set:

  • nn.Transformer with batch_first=True ; boolean masks where True = masked-out for both key-padding and causal masks.
  • PAD_ID = 0 , BOS_ID = 1 ; loss uses ignore_index = PAD_ID .
  • A tiny synthetic dataset (e.g. ~32 short examples, small vocab) and a tiny model (2 layers, d_model ≈ 64 ) are available so each debug iteration takes seconds.
  • The faults are independent and their symptoms overlap; assume more than one bug is present simultaneously.

Answer the following four parts.

1. Identify at least four root-cause bugs

Identify at least four distinct, plausible root-cause bugs behind these symptoms. Draw from the standard failure surface, for example:

  • Incorrect attention/causal mask construction — wrong shape, dtype, device, or polarity.
  • Off-by-one teacher-forcing label shift , and PAD tokens not ignored in the loss.
  • Missing or mis-placed optimizer.zero_grad (especially with gradient accumulation).
  • Misplaced gradient scaling / clipping before unscale under AMP.
  • Wrong positional-encoding shape or broadcast (or position indices exceeding the embedding table).
  • Incorrectly excluding LayerNorm/bias from weight decay .
  • Data leakage in batching.

2. For each bug: failure mode, fix, and test

For each bug you name, provide:

  • The failure mode — the signal you would look for to recognize it.
  • A minimal code fix .
  • A unit test or runtime assertion that would catch it and prevent regressions.

3. Systematic, component-by-component debugging plan

Lay out a systematic debugging plan that localizes each fault component by component. Give concrete checks and experiments for:

  • Data preprocessing — padding, truncation, label alignment.
  • Tokenization and attention/causal masks.
  • Positional encodings.
  • Loss computation — ignore_index , label smoothing, class weights.
  • Optimizer / scheduler / zero_grad / gradient clipping.
  • AMP / GradScaler settings.
  • Seed control and deterministic kernels.
  • DDP / Sampler configuration.

4. Sanity checks, metrics, and verification

Specify:

  • The sanity checks you would run — for example, overfitting a tiny subset or copy task, gradient and activation statistics, and NaN/Inf detection.
  • The metrics you would monitor .
  • A small experiment to verify each fix .

What a Strong Answer Covers

The interviewer is listening for these signals (these are the dimensions of a good answer, not the answers themselves):

  • Method over guessing — a stated reproduce → reduce → instrument → isolate → verify loop, not a hyperparameter lottery.
  • Explicit conventions — masks (bool/additive, polarity), batch_first , PAD/BOS/EOS, and ignore_index all stated up front.
  • Four genuinely distinct root causes , each mapped to the symptom(s) it explains, with a minimal fix and a regression test that pins an invariant .
  • Awareness of overlapping, simultaneous faults — recognizing that one symptom can have several causes.
  • AMP / accumulation correctness — treating the scaler, clipping, optimizer step, scheduler step, and zero_grad as an ordered, cadence-sensitive sequence rather than an unordered set of calls.
  • Determinism hygiene — seeds, deterministic kernels, DataLoader worker seeding, DDP sampler set_epoch .
  • A falsifiable verification protocol — overfit-tiny-subset / copy task as the wiring proof, plus a per-fix experiment and the metrics that would confirm it.

Follow-up Questions

Be ready for deeper probes after the main answer:

  • The pipeline now overfits a tiny subset but still won't reproduce bit-for-bit across runs. Where do you look next, and which "deterministic" ops have no deterministic CUDA implementation?
  • You switch from fp16 AMP to bf16 . Which of your bug fixes becomes unnecessary, and why?
  • Validation accuracy improves but is still a few points below train. How do you distinguish a residual data-leakage bug from genuine generalization gap?
  • How does the GradScaler order and zero_grad cadence change when you move from single-GPU to DDP with gradient accumulation (and where does model.no_sync() fit)?
Loading comments...

Browse More Questions

More Machine Learning•More OpenAI•More Machine Learning Engineer•OpenAI Machine Learning Engineer•OpenAI Machine Learning•Machine Learning Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.