Fix three bugs in a minimal GPT to meet a training-loss target
You are given a Colab notebook with a minimal GPT-style language model implemented in PyTorch (token embedding → transformer blocks → LM head), along with training and inference code. On a small toy dataset, training currently fails to reach the target loss.
Your task:
-
Identify and fix the following three issues so that training loss drops below a specified threshold on a small dataset:
-
Incorrect attention masking (causal mask and/or padding mask mishandled).
-
A bug in the training loop (e.g., missing optimizer.zero_grad(), not calling model.train(), misaligned input/target token shift, wrong device placement, or incorrect loss reduction).
-
Missing positional encoding integration.
-
Provide concrete code changes (edits or snippets) that implement each fix.
-
Provide unit tests that would have caught each bug.
-
Include a brief rationale for each fix.
Assume:
-
PyTorch 2.x, CUDA if available.
-
A tiny character-level dataset (or synthetic tokens) with a small vocab and fixed max sequence length (e.g., 64), with optional padding. Use ignore_index for pad.
-
Success criterion: training loss on the toy training set drops below 1.0 within a few epochs on CPU, or within 0.5 on GPU.