LLM Serving, Inference Scaling, KV Cache, and Latency-Cost Tradeoffs
Asked of: ML Engineer
Last updated
What's being tested
Interviewers are probing whether you can design and reason about low-latency LLM inference systems that scale cost-effectively while preserving model correctness. Expect to demonstrate knowledge of KV cache mechanics, batching and scheduling strategies, tradeoffs between throughput and tail latency (e.g., `p99`), and practical choices around hardware, quantization, and sharding. For a ML Engineer role, the focus is on deployment/infra decisions, instrumentation, and measurable mitigations—NOT on developing new model architectures.
Core knowledge
-
KV cache internals: For Transformers, the cache stores per-token keys and values for each layer; size ≈ seq_len * num_layers * hidden_size * 2 * bytes_per_float. If hidden_size = H, layers = L, bytes = b, cached tokens = S, then KV_bytes ≈ S * L * H * 2 * b.
-
Why KV cache matters: Recomputing past tokens is O(S * model_compute); using a cache turns autoregressive decoding from re-evaluating prefix to a single forward pass per new token, saving compute and reducing latency for multi-token generation.
-
Batching & latency tradeoffs: Larger batches increase GPU utilization (throughput) but add queueing latency. Define batching policy by SLO: choose max_batch and max_wait (milliseconds). Effective throughput ∝ batch_size until kernel memory/compute bottleneck.
-
Tail latency (
`p99`) sources: queueing (batcher), kernel startup (`CUDA`launches), memory swapping (cold KV evictions), and network serialization. Optimizations target each source differently. -
Memory vs compute tradeoffs: Offloading KV cache to CPU/NVMe reduces GPU memory pressure but adds network/PCIe latency; warm vs cold cache has large
`p99`effects. Consider compressed KV (quantized, fp16) to save memory. -
Quantization & kernel support:
`int8`and 4-bit quantization reduce memory and often increase throughput, but can increase latency for short requests if specialized kernels unavailable. Profile accuracy vs latency for your workload. -
Sharding strategies: data-parallel (replicas) for throughput, tensor/pipeline-sharding to host very large models across GPUs. Sharding reduces per-GPU KV requirements but increases cross-GPU communication for each token.
-
Scheduling algorithms: static micro-batching (fixed interval), adaptive batching (adjusts with load), and priority preemption (short jobs interrupt long jobs). Use latency SLO and percentile SLIs to drive choice.
-
Cost-per-token estimation: approximate Throughput improvements or mixed-instance routing decrease this value.
-
Autoscaling & warm pools: cold-starts are costly; maintain warm GPUs or provision small warm pools for fast path. Balance hourly cost vs request latency SLOs.
-
Observability metrics to instrument:
`p50/p90/p99`latency, batch sizes, queue depth, GPU utilization, KV cache hit/miss ratio, memory pressure, and tail-kernel times. -
Client vs server KV cache: client-side cache reduces server memory and network I/O but increases client complexity and potential staleness; server-side cache simplifies correctness and multi-device coordination.
Worked example — "Design an LLM inference serving architecture with KV cache and low `p99` for streaming chat"
Frame the problem: ask for workload characteristics — QPS, tokens per request distribution, latency SLO (e.g., `p99 < 200ms` per token), model size, budget, multi-tenancy constraints. Declare assumptions (steady-state load, model fits on single GPU or needs sharding). Organize the answer into three pillars: (1) request path and router (short vs long request routing); (2) batching & scheduler (adaptive micro-batching with max_wait tuned to SLO); (3) KV cache management (per-session server-side cache in GPU memory with eviction policy and CPU-backed overflow). Call out a key tradeoff: storing full KV on GPU ensures minimal per-token latency but increases cost due to fewer sessions per-GPU; offloading to CPU/NVMe lowers cost but adds ~ms of latency—choose based on `p99` budget. Describe mitigation steps: enable `FP16` KV storage, use dynamic batching with deadline-aware scheduling, and maintain warm pools to avoid cold-start kernel overhead. Close by saying: if more time, I'd prototype with representative traces, benchmark `p99` with and without cache offload, and add priority queuing for short interactive sessions.
A second angle — "Scale inference for massive concurrent short requests (1–2 tokens each)"
The same concepts shift: per-request token counts are tiny so kernel launch overhead, context-switching, and queueing dominate. Emphasize low-latency single-token kernels: use optimized runtimes (`Triton`/`TensorRT`) and consider CPU-optimized quantized models for ultra-short requests to avoid GPU scheduling overhead. Use extremely small max_wait and favor immediate dispatching or tiny micro-batches; implement smart routing that sends short-low-cost requests to cheaper instances and long-generation streams to GPU-backed, KV-cached instances. Here, client-side caching (reusing session context tokens on client and sending only deltas) can drastically reduce server load. Also consider model distillation or smaller specialist models for classification/short answers to reduce cost while meeting latency SLOs.
Common pitfalls
Pitfall: Underestimating KV memory. People forget the factor of num_layers and double for keys+values; consequence is running out of GPU memory under realistic session counts.
Pitfall: Treating throughput as the only metric. Optimizing for tokens/sec without measuring
`p99`latency will break interactive experiences—always present both metrics.
Pitfall: Ignoring kernel and PCIe latency for offloaded KV. Saying “offload to CPU” without quantifying additional ms per token is a shallow fix; measure end-to-end.
Connections
Adjacent interviewer pivots may include model parallelism & checkpointing (how sharding interacts with KV placement), autoscaling/ SRE practices (cost vs SLA ops playbooks), and model compression or distillation (reducing hardware needs vs fidelity).
Further reading
-
NVIDIA Triton Inference Server — practical deployment patterns and batching features.
-
FlashAttention (paper) — efficient attention kernels that reduce memory and improve throughput for many implementations.
Related concepts
- LLM Inference Serving, Batching, And KV Cache
- LLM Inference Optimization And KV CacheSoftware Engineering Fundamentals
- LLM Chat Applications, RAG, And ML EvaluationML System Design
- LLM Eval Data Slicing and Debugging
- ML Evaluation, Uncertainty, And Safety GuardrailsML System Design
- LLM Architecture, Tuning, And EvaluationMachine Learning