Explain Transformer Attention Fundamentals
Company: Adobe
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Onsite
In a machine learning fundamentals interview, explain the core mechanics of Transformer models and modern large language model training. Address the following topics:
1. Compare encoder-only, decoder-only, and encoder-decoder Transformers.
2. Explain bidirectional self-attention versus causal masked self-attention, and relate them to semantic understanding and autoregressive generation.
3. Explain how encoder-decoder cross-attention works.
4. Explain why encoder models generally do not need a key-value cache.
5. Walk through the full operation order inside multi-head attention, including Q/K/V projections, tensor reshaping, attention score computation, masking, softmax, value aggregation, head concatenation, and output projection.
6. Explain why multi-head attention usually has four linear projections.
7. Analyze the computational complexity of self-attention and explain why it becomes quadratic in sequence length.
8. Explain how a key-value cache works during autoregressive inference, why K and V are cached but Q is not, why it helps decoder-only models, and what memory tradeoffs it introduces.
9. Compare FP16, BF16, and FP32 for mixed precision training, including tensor core acceleration, loss scaling, FP32 master weights, and why BF16 is often preferred for large language models.
10. Explain common large language model training, alignment, and evaluation workflows, including direct preference optimization, teacher-student distillation, offline teacher generation with online student serving, LLM-as-a-judge evaluation, structured rubrics, offline versus online metrics, reward hacking, evaluator bias, alignment drift, and hallucination regression.
Quick Answer: This question evaluates understanding of Transformer attention mechanics, encoder/decoder distinctions, bidirectional versus causal self-attention, cross-attention and key-value caching, multi-head attention internals, computational complexity, mixed-precision training trade-offs (FP16/BF16/FP32), and large language model training, alignment, and evaluation workflows—assessing competencies in model architecture, numerical precision, memory/compute optimization, and alignment methodology. It is commonly asked in Machine Learning interviews to probe reasoning about architectural and implementation trade-offs, performance and memory complexity, and evaluation/alignment risks; the category is Machine Learning and the level of abstraction spans both conceptual understanding and practical application.