You are given a small Transformer model implementation (e.g., in PyTorch) plus a tiny training script. The code executes, but the model does not match a reference implementation: unit tests that check (1) the forward-pass output for a fixed input/seed and (2) the training loss for one step either fail or are inconsistent.
Task: Debug the model so that it runs end-to-end and matches the expected outputs/loss. The buggy code contains multiple independent issues, including:
-
An error in the attention mask (shape/broadcasting or causal/padding masking is applied incorrectly).
-
Incorrect parameter initialization (some weights are initialized with the wrong distribution/scale or not initialized at all).
-
A bug in the loss computation due to misaligned positions (e.g., logits/labels are shifted incorrectly for next-token prediction).
-
One additional hidden bug of similar difficulty (e.g., wrong softmax dimension, missing attention scaling by sqrt(d_k), wrong dtype/device handling, dropout/eval-mode misuse, or an off-by-one in sequence lengths).
Explain how you would systematically find and fix these issues, and what the correct implementations should look like.