Explain KV cache in Transformer inference

Q: Explain KV cache in Transformer inference

This question evaluates understanding of KV cache mechanisms in Transformer inference, including attention-state caching, memory and latency trade-offs, and engineering optimizations for autoregressive decoding.

Q: How do I approach Software Engineering Fundamentals interview questions?

Software Engineering Fundamentals questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master software engineering fundamentals interviews.

Question

Question

In Transformer-based language model inference, what is a key-value (KV) cache?

Explain:

What gets cached (tensors, shapes at a high level) and at which layers.
Why KV caching improves autoregressive decoding latency.
The difference between prefill (processing the prompt) and decode (generating tokens) phases.
Tradeoffs and pitfalls: memory growth, batch/sequence management, multi-head attention, and long-context handling.
At least two practical optimizations used in production (e.g., paged attention, quantized KV cache, sliding window).

Explain KV cache in Transformer inference

Quick Overview

Question

Solution

Comments (0)