Transformer Architectures And Attention
Asked of: Machine Learning Engineer
Last updated

What's being tested
Interviewers are probing whether you understand Transformer internals well enough to implement, debug, train, and serve them—not just describe them at a high level. For a Machine Learning Engineer, the bar is shape reasoning, masking correctness, numerical stability, training efficiency, and knowing how architectural choices affect latency, memory, and model quality. Amazon cares because production ML systems often use Transformer-based encoders, decoders, rankers, retrieval models, and LLM-backed services where small implementation errors can silently degrade relevance, safety, or cost. Expect follow-ups that connect math to `PyTorch` tensors, GPU memory, distributed training, and online serving constraints.
Core knowledge
-
Scaled dot-product attention computes where is an additive mask, often
0for allowed positions and-inffor blocked positions. The scaling prevents large logits from saturatingsoftmax. -
Multi-head attention projects an input into tensors shaped roughly . Heads let the model learn different relation patterns; implementation bugs usually come from incorrect
view,transpose, or non-contiguous tensors. -
Causal masking is mandatory for decoder-only language models. Token can attend only to positions , usually via a lower-triangular mask of shape . Forgetting this causes label leakage: training loss looks excellent, but generation quality fails.
-
Decoder-only GPT-style blocks typically contain masked self-attention, a position-wise feed-forward network, residual connections, and normalization. A common pre-norm block is
x = x + attention(LN(x)), thenx = x + MLP(LN(x)), which improves deep-network training stability versus post-norm. -
Layer normalization normalizes across the feature dimension for each token independently: where are computed over hidden features, and are learned per-feature parameters. Unlike batch norm, it does not depend on batch statistics.
-
Position information is required because self-attention alone is permutation-equivariant. Common choices include learned absolute embeddings, sinusoidal embeddings, rotary position embeddings (
`RoPE`), and relative position bias. For long-context serving, extrapolation behavior and`KV cache`compatibility matter. -
Feed-forward networks in Transformers are usually two linear layers with an activation such as
`GELU`, often expanding hidden size by about4x: . In LLMs, gated variants like`SwiGLU`improve quality but increase parameter and memory considerations. -
Training-time memory is dominated by activations, attention matrices, and optimizer states. Full attention is in sequence length and for attention probabilities; for long sequences, consider flash attention, gradient checkpointing, mixed precision, sequence packing, or sparse/local attention.
-
Inference-time bottlenecks differ from training. Autoregressive decoding is sequential over generated tokens, but
`KV cache`avoids recomputing keys and values for previous tokens, changing per-step attention from recomputing full prefix projections to attending against cached state. Cache memory grows with layers, heads, context length, and batch size. -
Mixture-of-Experts replaces dense MLPs with multiple expert networks and a learned router, often top-1 or top-2 routing. It increases parameter count without activating all parameters per token, but introduces load-balancing losses, capacity factors, token dropping risk, and distributed communication such as
`all-to-all`. -
Transformers versus CNNs is about inductive bias and scaling tradeoffs. CNNs encode locality and translation equivariance through shared kernels, making them sample-efficient for images; Transformers learn global interactions with weaker priors but scale well with data, compute, multimodal inputs, and variable-length sequences.
-
Evaluation and deployment require more than perplexity. For production MLE work, track offline loss, task metrics like
`AUC`or`NDCG`, calibration, latency`p50/p99`, GPU memory, throughput, drift, and online/offline parity. A lower loss model may be unacceptable if it doubles serving cost or violates latency SLOs.
Worked example
For “Implement decoder-only GPT-style transformer,” a strong candidate first clarifies the expected scope: “Should I implement a minimal `PyTorch` module with token embeddings, positional embeddings, masked multi-head self-attention, MLP blocks, and logits, or also include training and generation?” They state assumptions early: input token IDs have shape B x T, vocabulary size is V, embedding dimension is C, number of heads divides C, and the model returns logits shaped B x T x V. The answer should be organized around four pillars: embeddings and positions, Transformer block structure, attention tensor shapes and causal mask, and output projection/loss.
The implementation skeleton would include `nn.Embedding` for tokens, either learned positional embeddings or `RoPE`, a stack of blocks using pre-norm residuals, and a final `LayerNorm` plus linear language-model head. The candidate should explicitly explain that q, k, and v are projected from x, reshaped from B x T x C to B x n_heads x T x head_dim, then attention scores become B x n_heads x T x T. A specific tradeoff to flag is whether to use a simple registered triangular mask, which is clear and interview-friendly, or optimized kernels like torch.nn.functional.scaled_dot_product_attention / FlashAttention for production efficiency. They should call out common edge cases: T exceeding configured context length, incorrect mask broadcasting, missing dropout behavior between train and eval, and using .view() after transpose without .contiguous() or .reshape(). A good close is: “If I had more time, I’d add generation with `KV cache`, unit tests for shape and causal leakage, and a small overfit test to verify the implementation learns.”
A second angle
For “Explain Layer Normalization in Transformers,” the same concept shifts from architecture construction to training stability and numerical behavior. Instead of writing modules end to end, the candidate should derive the equation, identify the normalized dimension, and explain the roles of , , and . The key MLE insight is that LayerNorm behaves consistently across small or variable batch sizes, which is important for sequence models and inference workloads where batch composition changes. A strong answer also distinguishes pre-norm from post-norm: pre-norm improves gradient flow in deeper networks, while post-norm matches the original Transformer but can be harder to optimize at scale. The interviewer may then pivot to why incorrect normalization placement can cause divergence, slow convergence, or degraded perplexity even when tensor shapes are correct.
Common pitfalls
Pitfall: Treating attention as “the model looks at important tokens” without discussing masks, shapes, or scaling.
That answer is too abstract for an MLE interview. A better response names , gives the attention equation, describes the B x heads x T x T score tensor, and explains how masking prevents future-token leakage in decoder-only models.
Pitfall: Confusing LayerNorm with BatchNorm.
BatchNorm normalizes using batch-level statistics and behaves differently between training and inference, while LayerNorm normalizes per token across hidden features and uses the same computation at inference. This distinction matters for variable-length sequences, small batches, autoregressive decoding, and distributed training.
Pitfall: Over-indexing on model architecture while ignoring production constraints.
Saying “use a larger Transformer” is incomplete if you do not mention latency, memory, throughput, quantization, `KV cache`, or online metrics. For an Amazon MLE, a strong answer connects architecture choices to deployability: model size, sequence length, batching strategy, GPU utilization, and monitoring for quality regressions.
Connections
Interviewers may pivot from this topic into distributed training, including data parallelism, tensor parallelism, pipeline parallelism, and MoE `all-to-all` communication. They may also ask about model compression techniques such as quantization, distillation, pruning, and low-rank adaptation, or about evaluation for LLM-style systems, including perplexity, task accuracy, hallucination checks, and latency/cost tradeoffs.
Further reading
-
Attention Is All You Need — the original Transformer paper defining scaled dot-product attention, multi-head attention, positional encoding, and encoder-decoder blocks.
-
Language Models are Unsupervised Multitask Learners — GPT-2 paper; useful for understanding decoder-only language modeling at scale.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — practical introduction to sparse MoE routing, capacity, and scaling tradeoffs.
Featured in interview prep guides
Practice questions
- Explain ML evaluation, sequence models, and optimizersAmazon · Machine Learning Engineer · Onsite · medium
- Contrast CNNs and fully connected networksAmazon · Machine Learning Engineer · Technical Screen · easy
- Explain Transformers and MoE in LLMsAmazon · Machine Learning Engineer · Onsite · medium
- Implement decoder-only GPT-style transformerAmazon · Machine Learning Engineer · Onsite · medium
- Explain LLM architecture, tuning, evaluationAmazon · Machine Learning Engineer · Technical Screen · medium
- Explain Layer Normalization in TransformersAmazon · Machine Learning Engineer · Onsite · medium
- Explain imbalance, metrics, bias-variance, Transformers vs. CNNsAmazon · Machine Learning Engineer · Technical Screen · hard
- Explain LLM fundamentals and trade-offsAmazon · Machine Learning Engineer · Onsite · hard
Related concepts
- Transformer Attention And MaskingMachine Learning
- Generative AI Training, Attention, And Post-TrainingML System Design
- Transformer Architecture and Attention Internals
- Transformer Training Pipeline DebuggingMachine Learning
- ML Fundamentals: Backprop, Attention, And RLMachine Learning
- LLM Foundations, Embeddings, Prompts, And Fine-Tuning