Design efficient Transformer inference with KV cache
Company: Startups.Com
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Onsite
You are implementing autoregressive inference for a decoder-only Transformer.
1) Explain **what the KV cache is**, what tensors are cached per layer, and how it changes computation during incremental decoding.
2) Describe an implementation plan for KV caching that supports:
- Variable sequence lengths in a batch
- Beam search or speculative decoding (where sequences can branch)
- Long contexts (e.g., 32k–128k tokens)
3) Discuss key performance considerations:
- Memory layout and writes/reads when appending new K/V
- Avoiding reallocation/copies
- Interaction with fused attention kernels (e.g., FlashAttention-style)
- Precision choices (fp16/bf16/int8) for cache
4) What are common bugs or correctness pitfalls when adding a KV cache (masking, position encodings/RoPE, shape mismatches, etc.)?
Quick Answer: This question evaluates an engineer's competency in ML systems engineering, specifically understanding of decoder-only Transformer inference, KV cache semantics and which tensors are cached per layer, attention mechanics, memory layout, and correctness under branching and long-context workloads.