Transformer Architecture: Components, Shapes, Complexity, and Design Choices
Task
Explain the Transformer architecture in detail for a standard encoder–decoder model used in sequence modeling.
-
Layer composition
-
What sublayers are in each encoder and decoder layer?
-
Why include a position-wise feed-forward network (FFN)?
-
How do residual connections and layer normalization interact (pre-LN vs post-LN)?
-
Multi-head attention shapes
-
For d_model = 512, h = 8 heads, and sequence length n, give the tensor shapes for Q, K, V, attention scores, attention weights, per-head outputs, the concatenated output, and the final projection output. State shapes both per-head and stacked across heads; you may omit the batch dimension for clarity.
-
Computational complexity
-
Derive the time and memory complexity of self-attention with respect to n and d (assume d = d_model and typical d_k = d_v = d/h). Also note cross-attention complexity.
-
Implementation choices and trade-offs
-
Q/K/V projections (separate vs fused, multi-query/grouped-query attention).
-
Number of heads and head dimension.
-
Positional encodings (sinusoidal, learned, relative/rotary/ALiBi).
-
Include practical trade-offs (speed, memory, quality, length generalization).