Explain Layer Normalization in Transformers

Q: Explain Layer Normalization in Transformers

This question evaluates a candidate's understanding of normalization techniques in deep learning—specifically Layer Normalization in Transformer blocks—covering the LayerNorm equation and the roles of gamma, beta, and epsilon, as well as contrasts with BatchNorm and RMSNorm, in the Machine Learning domain.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Layer Normalization in Transformers: Placement, Gradients, and Practical Trade-offs

Task

Explain Layer Normalization (LayerNorm) as used in Transformer blocks. Address:

Where LayerNorm is applied: pre-norm vs post-norm, with precise formulas and where it sits relative to the residual connection.
Why this placement affects gradient flow and training stability.
The LayerNorm equation and the roles of gamma, beta, and epsilon.
A contrast with BatchNorm and RMSNorm (what is normalized, dependence on batch, pros/cons).
Practical guidance: initialization, placement choice (before vs after residual), and implications for inference latency and memory.

Assume a standard Transformer block contains two sublayers per block: Multi-Head Attention and a Feed-Forward Network (MLP), each wrapped by a residual connection.

Explain Layer Normalization in Transformers

Layer Normalization in Transformers: Placement, Gradients, and Practical Trade-offs

Task

Solution

Comments (0)

Explain Layer Normalization in Transformers

Overview

Layer Normalization in Transformers: Placement, Gradients, and Practical Trade-offs

Task

Solution

Comments (0)