Transformer Architectures And Attention

What's being tested

Interviewers are probing whether you understand Transformer internals well enough to implement, debug, train, and serve them—not just describe them at a high level. For a Machine Learning Engineer, the bar is shape reasoning, masking correctness, numerical stability, training efficiency, and knowing how architectural choices affect latency, memory, and model quality. Amazon cares because production ML systems often use Transformer-based encoders, decoders, rankers, retrieval models, and LLM-backed services where small implementation errors can silently degrade relevance, safety, or cost. Expect follow-ups that connect math to `PyTorch` tensors, GPU memory, distributed training, and online serving constraints.

Core knowledge

Scaled dot-product attention computes $\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)V$ where $M$ is an additive mask, often 0 for allowed positions and -inf for blocked positions. The $\sqrt{d_k}$ scaling prevents large logits from saturating softmax.
Multi-head attention projects an input $X \in \mathbb{R}^{B \times T \times d_{\text{model}}}$ into $Q,K,V$ tensors shaped roughly $B \times h \times T \times d_{\text{head}}$ . Heads let the model learn different relation patterns; implementation bugs usually come from incorrect view, transpose, or non-contiguous tensors.
Causal masking is mandatory for decoder-only language models. Token $t$ can attend only to positions $\leq t$ , usually via a lower-triangular mask of shape $T \times T$ . Forgetting this causes label leakage: training loss looks excellent, but generation quality fails.
Decoder-only GPT-style blocks typically contain masked self-attention, a position-wise feed-forward network, residual connections, and normalization. A common pre-norm block is x = x + attention(LN(x)), then x = x + MLP(LN(x)), which improves deep-network training stability versus post-norm.
Layer normalization normalizes across the feature dimension for each token independently: $\mathrm{LN}(x)=\gamma \frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta$ where $\mu,\sigma^2$ are computed over hidden features, and $\gamma,\beta$ are learned per-feature parameters. Unlike batch norm, it does not depend on batch statistics.
Position information is required because self-attention alone is permutation-equivariant. Common choices include learned absolute embeddings, sinusoidal embeddings, rotary position embeddings (`RoPE`), and relative position bias. For long-context serving, extrapolation behavior and `KV cache` compatibility matter.
Feed-forward networks in Transformers are usually two linear layers with an activation such as `GELU`, often expanding hidden size by about 4x: $d_{\text{model}} \rightarrow 4d_{\text{model}} \rightarrow d_{\text{model}}$ . In LLMs, gated variants like `SwiGLU` improve quality but increase parameter and memory considerations.
Training-time memory is dominated by activations, attention matrices, and optimizer states. Full attention is $O(T^2)$ in sequence length and $O(BT^2)$ for attention probabilities; for long sequences, consider flash attention, gradient checkpointing, mixed precision, sequence packing, or sparse/local attention.
Inference-time bottlenecks differ from training. Autoregressive decoding is sequential over generated tokens, but `KV cache` avoids recomputing keys and values for previous tokens, changing per-step attention from recomputing full prefix projections to attending against cached state. Cache memory grows with layers, heads, context length, and batch size.
Mixture-of-Experts replaces dense MLPs with multiple expert networks and a learned router, often top-1 or top-2 routing. It increases parameter count without activating all parameters per token, but introduces load-balancing losses, capacity factors, token dropping risk, and distributed communication such as `all-to-all`.
Transformers versus CNNs is about inductive bias and scaling tradeoffs. CNNs encode locality and translation equivariance through shared kernels, making them sample-efficient for images; Transformers learn global interactions with weaker priors but scale well with data, compute, multimodal inputs, and variable-length sequences.
Evaluation and deployment require more than perplexity. For production MLE work, track offline loss, task metrics like `AUC` or `NDCG`, calibration, latency `p50/p99`, GPU memory, throughput, drift, and online/offline parity. A lower loss model may be unacceptable if it doubles serving cost or violates latency SLOs.

Worked example

For “Implement decoder-only GPT-style transformer,” a strong candidate first clarifies the expected scope: “Should I implement a minimal `PyTorch` module with token embeddings, positional embeddings, masked multi-head self-attention, MLP blocks, and logits, or also include training and generation?” They state assumptions early: input token IDs have shape B x T, vocabulary size is V, embedding dimension is C, number of heads divides C, and the model returns logits shaped B x T x V. The answer should be organized around four pillars: embeddings and positions, Transformer block structure, attention tensor shapes and causal mask, and output projection/loss.

The implementation skeleton would include `nn.Embedding` for tokens, either learned positional embeddings or `RoPE`, a stack of blocks using pre-norm residuals, and a final `LayerNorm` plus linear language-model head. The candidate should explicitly explain that q, k, and v are projected from x, reshaped from B x T x C to B x n_heads x T x head_dim, then attention scores become B x n_heads x T x T. A specific tradeoff to flag is whether to use a simple registered triangular mask, which is clear and interview-friendly, or optimized kernels like torch.nn.functional.scaled_dot_product_attention / FlashAttention for production efficiency. They should call out common edge cases: T exceeding configured context length, incorrect mask broadcasting, missing dropout behavior between train and eval, and using .view() after transpose without .contiguous() or .reshape(). A good close is: “If I had more time, I’d add generation with `KV cache`, unit tests for shape and causal leakage, and a small overfit test to verify the implementation learns.”

A second angle

For “Explain Layer Normalization in Transformers,” the same concept shifts from architecture construction to training stability and numerical behavior. Instead of writing modules end to end, the candidate should derive the equation, identify the normalized dimension, and explain the roles of $\gamma$ , $\beta$ , and $\epsilon$ . The key MLE insight is that LayerNorm behaves consistently across small or variable batch sizes, which is important for sequence models and inference workloads where batch composition changes. A strong answer also distinguishes pre-norm from post-norm: pre-norm improves gradient flow in deeper networks, while post-norm matches the original Transformer but can be harder to optimize at scale. The interviewer may then pivot to why incorrect normalization placement can cause divergence, slow convergence, or degraded perplexity even when tensor shapes are correct.

Common pitfalls

Pitfall: Treating attention as “the model looks at important tokens” without discussing masks, shapes, or scaling.

That answer is too abstract for an MLE interview. A better response names $Q,K,V$ , gives the attention equation, describes the B x heads x T x T score tensor, and explains how masking prevents future-token leakage in decoder-only models.

Pitfall: Confusing LayerNorm with BatchNorm.

BatchNorm normalizes using batch-level statistics and behaves differently between training and inference, while LayerNorm normalizes per token across hidden features and uses the same computation at inference. This distinction matters for variable-length sequences, small batches, autoregressive decoding, and distributed training.

Pitfall: Over-indexing on model architecture while ignoring production constraints.

Saying “use a larger Transformer” is incomplete if you do not mention latency, memory, throughput, quantization, `KV cache`, or online metrics. For an Amazon MLE, a strong answer connects architecture choices to deployability: model size, sequence length, batching strategy, GPU utilization, and monitoring for quality regressions.

Connections

Interviewers may pivot from this topic into distributed training, including data parallelism, tensor parallelism, pipeline parallelism, and MoE `all-to-all` communication. They may also ask about model compression techniques such as quantization, distillation, pruning, and low-rank adaptation, or about evaluation for LLM-style systems, including perplexity, task accuracy, hallucination checks, and latency/cost tradeoffs.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts