Transformer Attention and Variants
Asked of: Machine Learning Engineer
Last updated
What's being tested
Candidates must demonstrate practical mastery of modern Transformer building blocks (self-attention mechanics, positional handling, and computational cost) plus knowledge of training and fine-tuning strategies (optimizer effects, parameter-efficient adapters such as LoRA), and vision-specific adaptations like ViT. Interviewers probe your ability to explain formulas, quantify tradeoffs (compute, memory, generalization), and choose architectures/optimizers for production constraints relevant to content ranking, personalization, and multimodal features at Netflix.
Core knowledge
-
Scaled dot-product attention: attention(Q,K,V)=softmax(QK^T / sqrt(d_k)) V; compute and memory are both O(N^2·d) for sequence length N and head dim d, making long contexts expensive.
-
Queries/Keys/Values: Q and K produce similarity logits; temperature sqrt(d_k) stabilizes gradients; using different projection matrices allows flexible information routing across heads.
-
Multi-head attention: parallel heads allow subspace specialization; concatenation + linear projection mixes heads; more heads increases parameters and communication cost but can improve representational richness.
-
Positional encodings: fixed sinusoidal vs learned absolute vs rotary/relative (RoPE); relative/rotary embeddings better support extrapolation to longer contexts and sparse attention patterns.
-
Tokenization families: Byte-Pair Encoding (BPE), WordPiece, and Unigram SentencePiece trade vocabulary size, unknown-token behavior, and encoding length; byte-level BPE avoids OOVs but increases token counts and latency.
-
Vision Transformer (ViT): images split into fixed-size patches, linear patch embedding + positional encoding + standard Transformer encoder; requires large-scale pretraining or hybrid conv backbones for strong performance on small datasets.
-
LoRA (Low-Rank Adaptation): fine-tune by adding low-rank matrices A,B so W' = W0 + BA; trains far fewer params, reduces checkpoint size, and enables many-task adapters without copying base weights. Typical rank r ≪ d.
-
Optimizers:
AdamvsSGD:Adamuses adaptive moments (fast initial loss drop),SGDwith momentum often yields better ultimate generalization for vision/ranking; useAdamW(decoupled weight decay) to avoid L2/Adamcoupling issues. Learning-rate schedules and warmup are critical for both. -
Scaling & efficiency tricks: gradient checkpointing reduces memory at cost of extra compute; mixed precision (
float16/bfloat16) accelerates throughput; distributedDDPwith gradient accumulation compensates for small per-GPU batch sizes. -
Sparse/linear attention variants: Longformer, BigBird, Reformer, Performer trade exact global attention for sparse approximations or locality to reduce O(N^2) to O(N·log N) or O(N). Each has different guarantees for expressivity and collision/coverage patterns.
-
Evaluation & deployment considerations: measure perplexity and downstream ranking metrics (e.g.,
NDCG) and monitor inferencep99latency, memory footprint, and A/B test effect onDAU-level engagement when selecting model/fine-tune approach. -
Layer selection for fine-tuning: adapters/LoRA applied to attention projection matrices (e.g., W_q, W_k, W_v, W_o) or MLP layers; earlier layers capture low-level features, later layers task-specific signals — freeze vs adapt decisions affect stability and storage.
Worked example — Explain self-attention, LoRA, Adam vs SGD, ViT
First 30 seconds: clarify scope (training from scratch vs fine-tuning; target modality: vision/text; dataset size and latency constraints). State assumptions: pre-trained Transformer backbone available, target is fine-tuning for personalization with latency <200ms.
Skeleton of response:
-
Explain self-attention math (QK^T / sqrt(d_k) → softmax → V) and O(N^2) cost; mention multi-head utility and positional encodings choice (absolute vs relative).
-
Describe LoRA: parameterize weight update as low-rank BA, show rank r tradeoff and which matrices to target (usually attention projections or MLP).
-
Compare Adam vs SGD: convergence speed, hyperparameter sensitivity, and generalization; recommend
AdamWwith warmup for large LMs,SGD+momentumfor vision models like ViT when training from scratch. -
Sketch ViT specifics: patching, positional embedding, need for large pretraining or hybrid conv stem, and common augmentations.
Flag one tradeoff: using LoRA reduces stored checkpoint size and enables many task adapters, but may underperform full fine-tuning if r too small or the task requires large representational shifts. Close: propose empirical plan — run small hyperparameter sweep over rank r, learning rate, and which layers to adapt; if more time, add ablations comparing adapter types and measure inference latency and A/B metrics.
A second angle — Explain tokenization and Transformer variants
Frame clarifying questions: is the input multilingual or byte-level, and are long-contexts required? Emphasize tokenization impacts embedding matrix size and sequence length: byte-level reduces OOVs but increases N (worse for O(N^2) attention). For long-context needs, present sparse/linear attention families (e.g., BigBird for block-sparse global tokens, Performer for kernelized linear attention) and discuss their theoretical tradeoffs: BigBird keeps expressivity for graphs/sequences with probabilistic sparsity guarantees; Performer approximates softmax with random feature maps, reducing memory but introducing approximation error. For vision, patch size in ViT trades spatial resolution vs sequence length; for tokenization and architecture choices, tie them to downstream metrics (latency, memory, perplexity, or NDCG).
Common pitfalls
Pitfall: Conflating
AdamwithAdamW. Saying "Adam includes weight decay" without noting decoupled decay (AdamW) leads to incorrect recommendations; specify optimizer variant and decay implementation.
Pitfall: Ignoring sequence-length scaling. Recommending vanilla Transformer for very long contexts without acknowledging O(N^2) memory will get called out; quantify N where quadratic cost becomes infeasible for your infra.
Pitfall: High-level prose over empirical plan. Saying "use LoRA" without proposing rank choices, which layers to adapt, or how to validate (holdout metric and latency) makes answers shallow; always propose a small experiment and metrics.
Connections
Interviewers might pivot to retrieval-augmented generation (RAG) for long-context problems or to model serving topics like quantization and TensorRT to meet latency/p99 SLAs. They may also ask about pretraining objectives (masked vs autoregressive) because optimizer and tokenization choices interact with objective design.
Further reading
-
Attention Is All You Need (Vaswani et al., 2017) — seminal Transformer self-attention formulation.
-
Vision Transformer (Dosovitskiy et al., 2020) — ViT architecture, patching, and data-scaling behavior.
-
LoRA: Low-Rank Adaptation (Hu et al., 2021) — practical parameter-efficient fine-tuning method and empirical guidance.
Practice questions
Related concepts
- Transformer Self-Attention and BackpropagationMachine Learning
- Transformer Attention And MaskingMachine Learning
- Transformer Architectures And AttentionMachine Learning
- Transformer Architecture And LLM LifecycleMachine Learning
- Transformer Architecture and Attention Internals
- Generative AI Training, Attention, And Post-TrainingML System Design