You are interviewing for an LLM-focused role.
-
FlashAttention
-
Explain what problem it solves in transformer attention.
-
Describe the high-level idea (how it reduces memory traffic) and its complexity implications.
-
When would you expect the biggest speedups, and what are practical limitations?
-
KV Cache (Key/Value cache) in decoding
-
Explain why KV caching is needed for autoregressive generation.
-
What is stored, how it changes per generated token, and how it affects time/memory complexity.
-
What are common optimizations (e.g., quantization, paging, chunking), and what trade-offs do they introduce?
-
RoPE (Rotary Positional Embeddings)
-
Explain how RoPE encodes position information compared to absolute embeddings.
-
Why does it help with extrapolation to longer contexts (relative position behavior)?
-
How does it interact with attention computation (queries/keys rotation) and what are common variants/edge cases?