Layer Normalization in Transformers: Placement, Gradients, and Practical Trade-offs
Task
Explain Layer Normalization (LayerNorm) as used in Transformer blocks. Address:
-
Where LayerNorm is applied: pre-norm vs post-norm, with precise formulas and where it sits relative to the residual connection.
-
Why this placement affects gradient flow and training stability.
-
The LayerNorm equation and the roles of gamma, beta, and epsilon.
-
A contrast with BatchNorm and RMSNorm (what is normalized, dependence on batch, pros/cons).
-
Practical guidance: initialization, placement choice (before vs after residual), and implications for inference latency and memory.
Assume a standard Transformer block contains two sublayers per block: Multi-Head Attention and a Feed-Forward Network (MLP), each wrapped by a residual connection.