PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Amazon

Explain Layer Normalization in Transformers

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's understanding of normalization techniques in deep learning—specifically Layer Normalization in Transformer blocks—covering the LayerNorm equation and the roles of gamma, beta, and epsilon, as well as contrasts with BatchNorm and RMSNorm, in the Machine Learning domain.

  • medium
  • Amazon
  • Machine Learning
  • Machine Learning Engineer

Explain Layer Normalization in Transformers

Company: Amazon

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Onsite

Explain Layer Normalization in Transformers. Where is it applied (pre-norm vs post-norm), and why does that choice affect gradient flow and training stability? Write the LayerNorm equation, discuss the roles of gamma/beta and epsilon, and contrast LayerNorm with BatchNorm and RMSNorm. Include guidance on initialization, placed-before vs placed-after residual connections, and implications for inference latency and memory.

Quick Answer: This question evaluates a candidate's understanding of normalization techniques in deep learning—specifically Layer Normalization in Transformer blocks—covering the LayerNorm equation and the roles of gamma, beta, and epsilon, as well as contrasts with BatchNorm and RMSNorm, in the Machine Learning domain.

Related Interview Questions

  • Predicting the Next Elevator Call Location - Amazon (medium)
  • Explain Transformer and MoE Fundamentals - Amazon (medium)
  • Explain Core ML Interview Concepts - Amazon (hard)
  • Evaluate NLP Classification Models - Amazon (easy)
  • Explain overfitting, regularization, and LLM techniques - Amazon (medium)
Amazon logo
Amazon
Sep 6, 2025, 12:00 AM
Machine Learning Engineer
Onsite
Machine Learning
7
0

Layer Normalization in Transformers: Placement, Gradients, and Practical Trade-offs

Task

Explain Layer Normalization (LayerNorm) as used in Transformer blocks. Address:

  1. Where LayerNorm is applied: pre-norm vs post-norm, with precise formulas and where it sits relative to the residual connection.
  2. Why this placement affects gradient flow and training stability.
  3. The LayerNorm equation and the roles of gamma, beta, and epsilon.
  4. A contrast with BatchNorm and RMSNorm (what is normalized, dependence on batch, pros/cons).
  5. Practical guidance: initialization, placement choice (before vs after residual), and implications for inference latency and memory.

Assume a standard Transformer block contains two sublayers per block: Multi-Head Attention and a Feed-Forward Network (MLP), each wrapped by a residual connection.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Amazon•More Machine Learning Engineer•Amazon Machine Learning Engineer•Amazon Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.