This question evaluates a candidate's understanding of normalization techniques in deep learning—specifically Layer Normalization in Transformer blocks—covering the LayerNorm equation and the roles of gamma, beta, and epsilon, as well as contrasts with BatchNorm and RMSNorm, in the Machine Learning domain.
Explain Layer Normalization (LayerNorm) as used in Transformer blocks. Address:
Assume a standard Transformer block contains two sublayers per block: Multi-Head Attention and a Feed-Forward Network (MLP), each wrapped by a residual connection.
Login required