Question
In Transformer-based language model inference, what is a key-value (KV) cache?
Explain:
-
What gets cached (tensors, shapes at a high level) and at which layers.
-
Why KV caching improves autoregressive decoding latency.
-
The difference between
prefill
(processing the prompt) and
decode
(generating tokens) phases.
-
Tradeoffs and pitfalls: memory growth, batch/sequence management, multi-head attention, and long-context handling.
-
At least two practical optimizations used in production (e.g., paged attention, quantized KV cache, sliding window).