Explain Transformer Layers and FFN Rationale
Company: UiPath
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
##### Question
Explain the Transformer architecture in detail, then walk through the math step by step.
1. Describe the components of each **encoder** and **decoder** layer: multi-head (self-)attention, residual ("Add") connections, layer normalization, and the position-wise feed-forward network (FFN). How do residual connections and layer normalization interact (Pre-LN vs. Post-LN)?
2. Why is a position-wise FFN needed *after* attention? What does it add that attention alone cannot provide?
3. Walk through the vector/matrix computations with shapes: the Q/K/V projections, attention-score scaling and softmax, the weighted sum that forms the context vectors, concatenation of heads, output projection, residual pathways, layer norms, and the FFN's two linear layers with activation. Use a concrete config (e.g. `d_model = 512`, `h = 8`, sequence length `n`) and give shapes for Q, K, V, the attention scores, and the block output.
4. Derive the **computational complexity** of self-attention with respect to sequence length `n` and model dimension `d`, and note where memory dominates.
5. Discuss common **implementation choices and trade-offs**: Q/K/V projection layout (separate vs. fused, MQA/GQA), number of heads vs. per-head dimension, positional-encoding schemes (sinusoidal, learned, relative, RoPE, ALiBi), and FFN/normalization variants (GELU/SwiGLU, RMSNorm).
6. (Optional) Compare encoder vs. decoder layers, and describe how representations evolve across the stack of layers.
Quick Answer: This question evaluates a candidate's understanding of the Transformer architecture: multi-head self-attention, residual connections and layer normalization (Pre-LN vs. Post-LN), and the position-wise feed-forward network and why it is needed. It also tests tensor-shape reasoning through Q/K/V projections and attention, derivation of the O(n²d + nd²) complexity, and implementation trade-offs such as MQA/GQA, positional-encoding schemes, and normalization choices.