PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/Applied Intuition

Implement KV cache for inference

Last updated: Apr 26, 2026

Quick Overview

This question evaluates understanding of key–value caching for Transformer decoder inference, covering tensor shapes for prefill and per-step decode, per-layer cache APIs, batching and beam-search mechanics, mixed-precision memory management, and complexity analysis in the domain of ML system design and transformer inference.

  • hard
  • Applied Intuition
  • ML System Design
  • Machine Learning Engineer

Implement KV cache for inference

Company: Applied Intuition

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Technical Screen

Design and implement key–value caching for autoregressive inference in a Transformer decoder. Specify tensor shapes and an API for maintaining per-layer caches across decoding steps and batched inputs, handle cache growth and memory limits, and support both greedy and beam search. Provide complexity analysis, expected speedups versus recomputation, and tests for correctness in edge cases (e.g., EOS in the middle of a batch).

Quick Answer: This question evaluates understanding of key–value caching for Transformer decoder inference, covering tensor shapes for prefill and per-step decode, per-layer cache APIs, batching and beam-search mechanics, mixed-precision memory management, and complexity analysis in the domain of ML system design and transformer inference.

Related Interview Questions

  • Debug a GPT training pipeline - Applied Intuition (medium)
Applied Intuition logo
Applied Intuition
Sep 6, 2025, 12:00 AM
Machine Learning Engineer
Technical Screen
ML System Design
4
0

Design Task: Key–Value Cache for Transformer Decoder Inference

Context

You are building an autoregressive inference engine for a Transformer decoder-only model. To avoid recomputing self-attention over the full prefix at each decoding step, implement key–value (K/V) caching at the per-layer level.

Assume a standard multi-head self-attention decoder with:

  • Batch size B (or effective batch B_eff when using beams)
  • Model dimension d_model, heads H, head dimension d_k = d_model / H
  • Max generation length L_max
  • Mixed precision (float16/bfloat16) support

Requirements

Design and implement K/V caching that:

  1. Specifies clear tensor shapes for prefill (prompt) and per-step decode for both greedy and beam search.
  2. Defines an API to maintain per-layer caches across decoding steps and batched inputs.
  3. Handles cache growth and memory limits (preallocation and an option for paged/block allocation).
  4. Supports greedy search and beam search (expand/reorder caches per step, EOS handling).
  5. Includes complexity analysis and expected speedups versus recomputation.
  6. Includes tests for correctness, including edge cases (e.g., EOS in the middle of a batch, variable prompt lengths, beam reorder correctness).

Provide a teaching-oriented solution with step-by-step reasoning, formulas where helpful, and code-style pseudocode for the API and critical paths.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Applied Intuition•More Machine Learning Engineer•Applied Intuition Machine Learning Engineer•Applied Intuition ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.