Implement Scaled Dot-Product Attention and a Transformer Block (No Autograd)
Context: Build multi-head self-attention and a Transformer encoder-style block from scratch for autoregressive next-token prediction. Implement both forward and backward passes manually (no autograd). A numerically stable softmax and its backward are provided.
Assume:
-
Inputs: X ∈ R^{B×T×d_model}
-
Multi-head attention with h heads; head_dim d_h = d_model / h
-
Learnable parameters: W_q, W_k, W_v, W_o ∈ R^{d_model×d_model}
-
Causal mask M ∈ R^{T×T}, where M[i, j] = 0 if j ≤ i, and M[i, j] = −∞ otherwise
-
Provided: softmax(z, axis=-1) and softmax_backward(dY, Y) that maps dY = ∂L/∂softmax(z) to dZ = ∂L/∂z
Tasks
-
Forward
-
Compute Q, K, V, split into heads, scores = Q K^T / sqrt(d_h)
-
Apply causal mask, softmax, attention output, merge heads, apply W_o
-
If building a full block, add Residual connections, LayerNorm, and position-wise Feed-Forward sublayer
-
Backward
-
Derive and implement gradients for W_q, W_k, W_v, W_o and inputs dX
-
Include gradients through scaling, matmuls, masking, head reshaping/transposes, and output projection
-
Use the provided softmax backward to map dScores to dLogits
-
Autoregressive Training
-
Define next-token cross-entropy loss with teacher forcing; show where causal mask is applied
-
Compute perplexity
-
Verification and Analysis
-
Implement finite-difference gradient checks on small tensors; report maximum relative errors
-
Discuss numerical stability (e.g., stabilized softmax with log-sum-exp)
-
Provide clear function signatures and tensor shapes at each step
-
Include time and memory complexity analysis