Diagnose Transformer training and inference bugs
Company: OpenAI
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
A Transformer-based sequence model intermittently throws shape/dtype mismatch errors and fails to converge after several thousand steps. Describe your end-to-end debugging approach. Include: how you validate tokenization, padding, and special tokens; verify attention masks (causal vs. bidirectional) and positional embeddings; add assertions and unit tests for tensor shapes and sequence lengths; pinpoint exploding/vanishing gradients (e.g., gradient norms, clipping, optimizer/betas, LR schedule, mixed precision); isolate data bugs (truncation, BOS/EOS handling, label shifting for LM objectives); diagnose loss NaNs and numerical instability; check checkpoint/resume logic and randomness/seed control; debug multi-GPU/DP/ZeRO issues (sync, grad scaling, AMP); profile throughput and memory (operator-level hotspots, OOM sources); and construct a minimal reproducible example with targeted logging to localize the fault. Provide concrete checks, metrics, and code-level instrumentation you would add.
Quick Answer: This question evaluates competency in debugging Transformer-based models, including tokenizer and padding correctness, attention mask and positional embedding validation, gradient and optimizer behavior, numerical stability and NaN sources, multi-GPU and checkpointing issues, data integrity, and performance profiling within the Machine Learning domain. It is commonly asked to assess an engineer's ability to diagnose complex, system-level training and inference failures and tests both conceptual understanding of model internals and practical application skills in instrumentation, reproducibility, and large-scale debugging.