Debug a transformer training pipeline
Company: OpenAI
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
You are given a PyTorch Transformer training pipeline whose loss diverges and validation accuracy remains near random. The repo includes tokenization, padding/masking, a multi-head attention encoder-decoder, AdamW with weight decay, gradient clipping, a cosine LR scheduler with warmup, mixed precision, and teacher forcing. Identify four distinct bugs that could plausibly cause these symptoms (e.g., incorrect attention mask construction or dtype/device, off-by-one label shift, missing gradient zeroing, misplaced gradient scaling, wrong positional encoding shape/broadcast, excluding LayerNorm/bias from weight decay incorrectly, data leakage in batching). For each bug: explain the failure mode, show a minimal code fix, and write a unit test or runtime assertion to catch it. Then propose a debugging plan (sanity checks like copy-task, gradient/activation statistics, NaN detection), metrics to monitor, and a small experiment to verify each fix.
Quick Answer: This question evaluates debugging and troubleshooting skills for deep learning training pipelines, testing understanding of Transformer internals, masking and positional encodings, optimization and numerical stability in PyTorch, data/label alignment, and the ability to create unit tests and assertions for regressions.