Explain the Transformer Architecture
Goal
Provide a clear, step-by-step explanation of a Transformer layer, covering:
-
Multi-head self-attention (MHA)
-
Residual connections
-
Layer normalization
-
Position-wise feed-forward network (FFN)
Also explain why an FFN is needed after attention and what benefits it provides. Walk through the vector computations with matrix dimensions and shapes, including:
-
Q/K/V projections
-
Attention score scaling and softmax
-
Weighted sum to form context vectors
-
Concatenation of heads and output projection
-
Residual pathways and layer norms
-
The FFN’s two linear layers with activation
Optionally compare encoder vs. decoder layers and discuss how representations evolve across stacked layers.
Assumptions and Notation
-
Batch size: B
-
Sequence length: L
-
Model dimension: d_model
-
Number of heads: h
-
Per-head dimensions: d_k = d_v = d_model / h (typical)
-
Input token embeddings plus positional encodings: X ∈ R^{B × L × d_model}