Explain attention variants and their tradeoffs
Company: Startups.Com
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Onsite
You are asked to explain and reason about modern Transformer attention mechanisms.
1) **Scaled dot-product attention**
- Define the operation mathematically (including the scaling term) and explain why scaling is used.
- Provide the typical tensor shapes for `Q`, `K`, `V` in a batched setting.
2) **Multi-head attention (MHA)**
- Explain how MHA differs from single-head attention, including the projection matrices, per-head computation, concatenation, and output projection.
- Discuss compute and memory complexity with respect to sequence length `L`, number of heads `H`, and head dimension `d`.
3) **Group-query attention (GQA)**
- Define GQA and how it differs from:
- MHA (each head has its own K/V)
- Multi-query attention (MQA: all heads share one K/V)
- Explain why GQA is commonly used for LLM inference.
4) **When would you choose MHA vs GQA vs MQA?**
- Discuss quality/expressiveness tradeoffs, KV-cache size, bandwidth, and practical deployment constraints.
Quick Answer: This question evaluates understanding of Transformer attention mechanisms—scaled dot-product, multi-head, group-query and multi-query variants—and measures competency in model internals, tensor shapes, computational and memory complexity, and inference deployment trade-offs.