Explain Transformer Layers and FFN Rationale

Q: Explain Transformer Layers and FFN Rationale

This question evaluates a candidate's understanding of Transformer architecture components such as multi-head self-attention, residual connections, layer normalization, and the position-wise feed-forward network, including competence with tensor shapes and matrix projections.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Explain the Transformer Architecture

Goal

Provide a clear, step-by-step explanation of a Transformer layer, covering:

Multi-head self-attention (MHA)
Residual connections
Layer normalization
Position-wise feed-forward network (FFN)

Also explain why an FFN is needed after attention and what benefits it provides. Walk through the vector computations with matrix dimensions and shapes, including:

Q/K/V projections
Attention score scaling and softmax
Weighted sum to form context vectors
Concatenation of heads and output projection
Residual pathways and layer norms
The FFN’s two linear layers with activation

Optionally compare encoder vs. decoder layers and discuss how representations evolve across stacked layers.

Assumptions and Notation

Batch size: B
Sequence length: L
Model dimension: d_model
Number of heads: h
Per-head dimensions: d_k = d_v = d_model / h (typical)
Input token embeddings plus positional encodings: X ∈ R^{B × L × d_model}

Explain Transformer Layers and FFN Rationale

Quick Overview

Explain the Transformer Architecture

Goal

Assumptions and Notation

Solution

Comments (0)