This question evaluates understanding of key–value caching for Transformer decoder inference, covering tensor shapes for prefill and per-step decode, per-layer cache APIs, batching and beam-search mechanics, mixed-precision memory management, and complexity analysis in the domain of ML system design and transformer inference.
You are building an autoregressive inference engine for a Transformer decoder-only model. To avoid recomputing self-attention over the full prefix at each decoding step, implement key–value (K/V) caching at the per-layer level.
Assume a standard multi-head self-attention decoder with:
Design and implement K/V caching that:
Provide a teaching-oriented solution with step-by-step reasoning, formulas where helpful, and code-style pseudocode for the API and critical paths.
Login required