LLM Inference Serving, Batching, And KV Cache
Asked of: Software Engineer
Last updated
What's being tested
Interviewers are probing whether you can reason about high-throughput, low-latency serving systems for autoregressive models under real production constraints. The core skill is mapping user-facing metrics like TTFT, TPOT, p95, and p99 to concrete infrastructure mechanisms: batching, scheduling, GPU memory management, KV cache reuse, and backpressure. OpenAI cares because small serving inefficiencies compound dramatically at scale: wasted GPU memory, poor batching, or bad admission control can turn into large cost increases and degraded user experience. A strong Software Engineer answer should stay systems-focused: throughput, latency, fault tolerance, capacity planning, observability, and tradeoffs.
Core knowledge
-
Autoregressive inference has two phases: prefill, where the prompt tokens are processed in parallel, and decode, where new tokens are generated one step at a time. Prefill is compute-heavy; decode is often memory-bandwidth and scheduling-heavy because every request advances token-by-token.
-
Time to first token (
TTFT) is dominated by queueing plus prefill; time per output token (TPOT) is dominated by decode efficiency. A good serving design tracks both separately, because optimizing batching may improve throughput while hurtingTTFTfor interactive users. -
Static batching waits to collect requests before running a batch, improving GPU utilization but adding queueing delay. Continuous batching admits and removes requests at token boundaries, allowing long and short generations to coexist without waiting for the longest sequence in a batch.
-
KV cache stores attention keys and values from prior tokens so the model does not recompute the full prefix at every decode step. Approximate memory is:
whereLis layers,H_kvKV heads,Dhead dimension, andTcached tokens. -
KV cache memory, not raw FLOPs, often limits serving concurrency. Long contexts and many simultaneous requests can exhaust GPU memory even when compute is available. Production systems need eviction, admission control, paging, or request rejection before out-of-memory failures occur.
-
PagedAttention, popularized by
vLLM, treats KV cache like virtual memory pages instead of requiring contiguous allocation. This reduces fragmentation, improves cache sharing, and allows higher batch concurrency for workloads with variable prompt and generation lengths. -
Prefix caching reuses KV cache entries for shared prompts, system messages, retrieval boilerplate, or repeated conversation prefixes. It is most useful when exact prefixes recur; it introduces correctness concerns around cache keys, tokenizer versioning, model versioning, and tenant isolation.
-
Speculative decoding uses a smaller draft model or heuristic proposer to generate candidate tokens, then verifies them with the target model. From a serving perspective, it trades extra compute and implementation complexity for lower latency when acceptance rates are high.
-
Scheduling policy matters. First-come-first-served is simple but can let long prompts block short ones. Alternatives include priority queues, max-token budgets, shortest-remaining-work approximations, per-tenant fairness, and separate pools for interactive versus batch workloads.
-
Admission control should be based on estimated token cost, not request count alone. A single 128k-token prompt can consume more memory and compute than hundreds of short requests. Practical systems estimate prompt tokens, max output tokens, cache footprint, and current GPU pressure before admitting work.
-
Tail latency is amplified by stragglers, oversized prompts, overloaded GPUs, cache misses, and retries. Track
p50,p95, andp99for queue time, prefill time, decode step time, tokens/sec, OOM count, batch size, and KV cache utilization separately. -
Failure handling should distinguish retryable infrastructure failures from deterministic capacity failures. Retrying an OOM request on the same saturated worker can create a retry storm; better options include routing to a larger-memory pool, reducing max tokens, or returning a clear capacity error.
Worked example
For “Design an LLM inference serving system”, a strong candidate would first clarify the workload: interactive chat or offline batch, expected QPS, average and max prompt length, output length distribution, latency SLOs, model size, GPU type, and multi-tenant requirements. They would state assumptions, for example: “I’ll optimize for interactive traffic with low TTFT, bounded p99, streaming responses, and high GPU utilization.” The answer should be organized around four pillars: request routing and admission control, GPU worker execution with continuous batching, KV cache memory management, and observability/failure handling.
The serving path could be: an API gateway authenticates and rate-limits, a scheduler estimates token cost, requests are routed to a worker pool with compatible model replicas, and each worker uses continuous batching across prefill and decode. The candidate should explicitly call out the prefill/decode distinction: prefill can be batched by prompt length buckets, while decode benefits from token-level scheduling and fast cache access. A key design decision is whether to prioritize throughput or latency: larger batches improve GPU utilization but can increase TTFT, so the system may use separate pools or different batching windows for interactive versus batch traffic. They should discuss KV cache allocation as a first-class resource, using paged allocation or eviction rather than assuming memory is infinite. A good close would be: “If I had more time, I’d add model-version rollout, prefix-cache correctness, cross-region routing, and a load-testing plan based on synthetic token distributions.”
A second angle
For “How would you reduce latency and cost for an existing LLM API?”, the framing shifts from greenfield design to diagnosis and incremental optimization. Start with metrics decomposition: queue time, prefill time, decode time, average batch size, GPU utilization, KV cache hit rate, and OOM/retry rate. If queue time is high but GPU utilization is low, batching or scheduling is likely inefficient; if GPU memory is saturated, KV cache pressure or long-context admission may be the bottleneck. The same concepts apply, but the best answer is more empirical: propose instrumentation first, then targeted changes like continuous batching, prefix caching, separate traffic classes, or max-token-aware admission control. The tradeoff to emphasize is that cost improvements often come from higher utilization, while latency improvements may require reserving headroom.
Common pitfalls
Pitfall: Treating batching as “always good.”
Batching improves throughput only up to the point where queueing delay, padding waste, or decode-step contention hurts user-visible latency. A stronger answer distinguishes static batching from continuous batching and ties the decision to TTFT, TPOT, and traffic class.
Pitfall: Ignoring KV cache as a capacity constraint.
A tempting but incomplete answer focuses on GPUs, replicas, and load balancers while assuming requests are just CPU-style jobs. In LLM serving, cache memory scales with active tokens, so admission control must account for prompt length, output cap, model dimensions, precision, and fragmentation.
Pitfall: Over-indexing on model-level techniques.
It is fine to mention speculative decoding or quantization, but a Software Engineer interview usually wants serving-system reasoning. Spend more time on scheduling, backpressure, cache allocation, observability, failure modes, and rollout safety than on architecture details of transformers.
Connections
This topic often pivots into distributed rate limiting, load balancing, GPU resource scheduling, streaming APIs, and tail-latency debugging. Interviewers may also connect it to caching correctness, capacity planning, multi-tenant fairness, or designing a resilient gRPC/HTTP service around long-lived streaming responses.
Further reading
-
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — foundational paper for paged KV cache management and high-throughput continuous batching.
-
Orca: A Distributed Serving System for Transformer-Based Generative Models — introduces iteration-level scheduling for efficient generative model serving.
-
FlashAttention: Fast and Memory-Efficient Exact Attention — useful background on attention memory bottlenecks, though lower-level than most SWE system design discussions.
Related concepts
- LLM Serving, Inference Scaling, KV Cache, and Latency-Cost Tradeoffs
- LLM Inference Optimization And KV CacheSoftware Engineering Fundamentals
- LLM Chat Applications, RAG, And ML EvaluationML System Design
- LLM Architecture, Tuning, And EvaluationMachine Learning
- LLM Product Understanding, RAG, And Fine-Tuning
- LLM Decoding, Sampling, And Logit Controls