This question evaluates a candidate's understanding of Transformer architecture components such as multi-head self-attention, residual connections, layer normalization, and the position-wise feed-forward network, including competence with tensor shapes and matrix projections.
Provide a clear, step-by-step explanation of a Transformer layer, covering:
Also explain why an FFN is needed after attention and what benefits it provides. Walk through the vector computations with matrix dimensions and shapes, including:
Optionally compare encoder vs. decoder layers and discuss how representations evolve across stacked layers.
Login required