Transformer Attention and Variants

What's being tested

Candidates must demonstrate practical mastery of modern Transformer building blocks (self-attention mechanics, positional handling, and computational cost) plus knowledge of training and fine-tuning strategies (optimizer effects, parameter-efficient adapters such as LoRA), and vision-specific adaptations like ViT. Interviewers probe your ability to explain formulas, quantify tradeoffs (compute, memory, generalization), and choose architectures/optimizers for production constraints relevant to content ranking, personalization, and multimodal features at Netflix.

Core knowledge

Scaled dot-product attention: attention(Q,K,V)=softmax(QK^T / sqrt(d_k)) V; compute and memory are both O(N^2·d) for sequence length N and head dim d, making long contexts expensive.
Queries/Keys/Values: Q and K produce similarity logits; temperature sqrt(d_k) stabilizes gradients; using different projection matrices allows flexible information routing across heads.
Multi-head attention: parallel heads allow subspace specialization; concatenation + linear projection mixes heads; more heads increases parameters and communication cost but can improve representational richness.
Positional encodings: fixed sinusoidal vs learned absolute vs rotary/relative (RoPE); relative/rotary embeddings better support extrapolation to longer contexts and sparse attention patterns.
Tokenization families: Byte-Pair Encoding (BPE), WordPiece, and Unigram SentencePiece trade vocabulary size, unknown-token behavior, and encoding length; byte-level BPE avoids OOVs but increases token counts and latency.
Vision Transformer (ViT): images split into fixed-size patches, linear patch embedding + positional encoding + standard Transformer encoder; requires large-scale pretraining or hybrid conv backbones for strong performance on small datasets.
LoRA (Low-Rank Adaptation): fine-tune by adding low-rank matrices A,B so W' = W0 + BA; trains far fewer params, reduces checkpoint size, and enables many-task adapters without copying base weights. Typical rank r ≪ d.
Optimizers: Adam vs SGD: Adam uses adaptive moments (fast initial loss drop), SGD with momentum often yields better ultimate generalization for vision/ranking; use AdamW (decoupled weight decay) to avoid L2/Adam coupling issues. Learning-rate schedules and warmup are critical for both.
Scaling & efficiency tricks: gradient checkpointing reduces memory at cost of extra compute; mixed precision (float16 / bfloat16) accelerates throughput; distributed DDP with gradient accumulation compensates for small per-GPU batch sizes.
Sparse/linear attention variants: Longformer, BigBird, Reformer, Performer trade exact global attention for sparse approximations or locality to reduce O(N^2) to O(N·log N) or O(N). Each has different guarantees for expressivity and collision/coverage patterns.
Evaluation & deployment considerations: measure perplexity and downstream ranking metrics (e.g., NDCG) and monitor inference p99 latency, memory footprint, and A/B test effect on DAU-level engagement when selecting model/fine-tune approach.
Layer selection for fine-tuning: adapters/LoRA applied to attention projection matrices (e.g., W_q, W_k, W_v, W_o) or MLP layers; earlier layers capture low-level features, later layers task-specific signals — freeze vs adapt decisions affect stability and storage.

Worked example — Explain self-attention, LoRA, Adam vs SGD, ViT

First 30 seconds: clarify scope (training from scratch vs fine-tuning; target modality: vision/text; dataset size and latency constraints). State assumptions: pre-trained Transformer backbone available, target is fine-tuning for personalization with latency <200ms.

Skeleton of response:

Explain self-attention math (QK^T / sqrt(d_k) → softmax → V) and O(N^2) cost; mention multi-head utility and positional encodings choice (absolute vs relative).
Describe LoRA: parameterize weight update as low-rank BA, show rank r tradeoff and which matrices to target (usually attention projections or MLP).
Compare Adam vs SGD: convergence speed, hyperparameter sensitivity, and generalization; recommend AdamW with warmup for large LMs, SGD+momentum for vision models like ViT when training from scratch.
Sketch ViT specifics: patching, positional embedding, need for large pretraining or hybrid conv stem, and common augmentations.

Flag one tradeoff: using LoRA reduces stored checkpoint size and enables many task adapters, but may underperform full fine-tuning if r too small or the task requires large representational shifts. Close: propose empirical plan — run small hyperparameter sweep over rank r, learning rate, and which layers to adapt; if more time, add ablations comparing adapter types and measure inference latency and A/B metrics.

A second angle — Explain tokenization and Transformer variants

Frame clarifying questions: is the input multilingual or byte-level, and are long-contexts required? Emphasize tokenization impacts embedding matrix size and sequence length: byte-level reduces OOVs but increases N (worse for O(N^2) attention). For long-context needs, present sparse/linear attention families (e.g., BigBird for block-sparse global tokens, Performer for kernelized linear attention) and discuss their theoretical tradeoffs: BigBird keeps expressivity for graphs/sequences with probabilistic sparsity guarantees; Performer approximates softmax with random feature maps, reducing memory but introducing approximation error. For vision, patch size in ViT trades spatial resolution vs sequence length; for tokenization and architecture choices, tie them to downstream metrics (latency, memory, perplexity, or NDCG).

Common pitfalls

Pitfall: Conflating Adam with AdamW. Saying "Adam includes weight decay" without noting decoupled decay (AdamW) leads to incorrect recommendations; specify optimizer variant and decay implementation.

Pitfall: Ignoring sequence-length scaling. Recommending vanilla Transformer for very long contexts without acknowledging O(N^2) memory will get called out; quantify N where quadratic cost becomes infeasible for your infra.

Pitfall: High-level prose over empirical plan. Saying "use LoRA" without proposing rank choices, which layers to adapt, or how to validate (holdout metric and latency) makes answers shallow; always propose a small experiment and metrics.

Connections

Interviewers might pivot to retrieval-augmented generation (RAG) for long-context problems or to model serving topics like quantization and TensorRT to meet latency/p99 SLAs. They may also ask about pretraining objectives (masked vs autoregressive) because optimizer and tokenization choices interact with objective design.

What's being tested

Core knowledge

Worked example — Explain self-attention, LoRA, Adam vs SGD, ViT

A second angle — Explain tokenization and Transformer variants

Common pitfalls

Connections

Further reading

Practice questions

Related concepts