PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/OpenAI

Debug a transformer training pipeline

Last updated: Mar 29, 2026

Quick Overview

This question evaluates debugging and troubleshooting skills for deep learning training pipelines, testing understanding of Transformer internals, masking and positional encodings, optimization and numerical stability in PyTorch, data/label alignment, and the ability to create unit tests and assertions for regressions.

  • hard
  • OpenAI
  • Machine Learning
  • Machine Learning Engineer

Debug a transformer training pipeline

Company: OpenAI

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

You are given a PyTorch Transformer training pipeline whose loss diverges and validation accuracy remains near random. The repo includes tokenization, padding/masking, a multi-head attention encoder-decoder, AdamW with weight decay, gradient clipping, a cosine LR scheduler with warmup, mixed precision, and teacher forcing. Identify four distinct bugs that could plausibly cause these symptoms (e.g., incorrect attention mask construction or dtype/device, off-by-one label shift, missing gradient zeroing, misplaced gradient scaling, wrong positional encoding shape/broadcast, excluding LayerNorm/bias from weight decay incorrectly, data leakage in batching). For each bug: explain the failure mode, show a minimal code fix, and write a unit test or runtime assertion to catch it. Then propose a debugging plan (sanity checks like copy-task, gradient/activation statistics, NaN detection), metrics to monitor, and a small experiment to verify each fix.

Quick Answer: This question evaluates debugging and troubleshooting skills for deep learning training pipelines, testing understanding of Transformer internals, masking and positional encodings, optimization and numerical stability in PyTorch, data/label alignment, and the ability to create unit tests and assertions for regressions.

Related Interview Questions

  • Implement Backprop for a Tiny Network - OpenAI (hard)
  • Filter Bad Human Annotations - OpenAI (medium)
  • Compute Matrix Prefix Products And Gradients - OpenAI (hard)
  • Improve Training With Noisy Annotators - OpenAI (hard)
  • Debug a Broken Transformer - OpenAI (medium)
OpenAI logo
OpenAI
Sep 6, 2025, 12:00 AM
Machine Learning Engineer
Technical Screen
Machine Learning
40
0

Diagnose a Diverging PyTorch Transformer Training Run

You are given a PyTorch Transformer training pipeline whose loss diverges and validation accuracy remains near random. The repository already includes:

  • Tokenization, padding, and masking
  • An encoder–decoder Transformer with multi-head attention
  • AdamW with weight decay, gradient clipping
  • Cosine LR scheduler with warmup
  • Mixed precision (torch.cuda.amp.GradScaler)
  • Teacher forcing in the decoder

Task

  1. Identify four distinct bugs that could plausibly cause divergence or random validation accuracy (e.g., incorrect attention mask construction or dtype/device, off-by-one label shift, missing gradient zeroing, misplaced gradient scaling, wrong positional encoding shape/broadcast, excluding LayerNorm/bias from weight decay incorrectly, data leakage in batching).
  2. For each bug:
    • Explain the failure mode (why it breaks learning).
    • Show a minimal code fix.
    • Provide a unit test or runtime assertion to catch it in the future.
  3. Propose a debugging plan including sanity checks (e.g., copy task), gradient/activation statistics, NaN detection; define which metrics to monitor; and outline a small experiment to verify each fix.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More OpenAI•More Machine Learning Engineer•OpenAI Machine Learning Engineer•OpenAI Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.