LLM Serving, Inference Scaling, KV Cache, and Latency-Cost Tradeoffs

What's being tested

Interviewers are probing whether you can design and reason about low-latency LLM inference systems that scale cost-effectively while preserving model correctness. Expect to demonstrate knowledge of KV cache mechanics, batching and scheduling strategies, tradeoffs between throughput and tail latency (e.g., `p99`), and practical choices around hardware, quantization, and sharding. For a ML Engineer role, the focus is on deployment/infra decisions, instrumentation, and measurable mitigations—NOT on developing new model architectures.

Core knowledge

KV cache internals: For Transformers, the cache stores per-token keys and values for each layer; size ≈ seq_len * num_layers * hidden_size * 2 * bytes_per_float. If hidden_size = H, layers = L, bytes = b, cached tokens = S, then KV_bytes ≈ S * L * H * 2 * b.
Why KV cache matters: Recomputing past tokens is O(S * model_compute); using a cache turns autoregressive decoding from re-evaluating prefix to a single forward pass per new token, saving compute and reducing latency for multi-token generation.
Batching & latency tradeoffs: Larger batches increase GPU utilization (throughput) but add queueing latency. Define batching policy by SLO: choose max_batch and max_wait (milliseconds). Effective throughput ∝ batch_size until kernel memory/compute bottleneck.
Tail latency (`p99`) sources: queueing (batcher), kernel startup (`CUDA` launches), memory swapping (cold KV evictions), and network serialization. Optimizations target each source differently.
Memory vs compute tradeoffs: Offloading KV cache to CPU/NVMe reduces GPU memory pressure but adds network/PCIe latency; warm vs cold cache has large `p99` effects. Consider compressed KV (quantized, fp16) to save memory.
Quantization & kernel support: `int8` and 4-bit quantization reduce memory and often increase throughput, but can increase latency for short requests if specialized kernels unavailable. Profile accuracy vs latency for your workload.
Sharding strategies: data-parallel (replicas) for throughput, tensor/pipeline-sharding to host very large models across GPUs. Sharding reduces per-GPU KV requirements but increases cross-GPU communication for each token.
Scheduling algorithms: static micro-batching (fixed interval), adaptive batching (adjusts with load), and priority preemption (short jobs interrupt long jobs). Use latency SLO and percentile SLIs to drive choice.
Cost-per-token estimation: approximate $\text{cost/token} \approx \frac{\text{GPU\_hour\_cost}}{\text{tokens\_generated\_per\_hour}}.$ Throughput improvements or mixed-instance routing decrease this value.
Autoscaling & warm pools: cold-starts are costly; maintain warm GPUs or provision small warm pools for fast path. Balance hourly cost vs request latency SLOs.
Observability metrics to instrument: `p50/p90/p99` latency, batch sizes, queue depth, GPU utilization, KV cache hit/miss ratio, memory pressure, and tail-kernel times.
Client vs server KV cache: client-side cache reduces server memory and network I/O but increases client complexity and potential staleness; server-side cache simplifies correctness and multi-device coordination.

Worked example — "Design an LLM inference serving architecture with KV cache and low `p99` for streaming chat"

Frame the problem: ask for workload characteristics — QPS, tokens per request distribution, latency SLO (e.g., `p99 < 200ms` per token), model size, budget, multi-tenancy constraints. Declare assumptions (steady-state load, model fits on single GPU or needs sharding). Organize the answer into three pillars: (1) request path and router (short vs long request routing); (2) batching & scheduler (adaptive micro-batching with max_wait tuned to SLO); (3) KV cache management (per-session server-side cache in GPU memory with eviction policy and CPU-backed overflow). Call out a key tradeoff: storing full KV on GPU ensures minimal per-token latency but increases cost due to fewer sessions per-GPU; offloading to CPU/NVMe lowers cost but adds ~ms of latency—choose based on `p99` budget. Describe mitigation steps: enable `FP16` KV storage, use dynamic batching with deadline-aware scheduling, and maintain warm pools to avoid cold-start kernel overhead. Close by saying: if more time, I'd prototype with representative traces, benchmark `p99` with and without cache offload, and add priority queuing for short interactive sessions.

A second angle — "Scale inference for massive concurrent short requests (1–2 tokens each)"

The same concepts shift: per-request token counts are tiny so kernel launch overhead, context-switching, and queueing dominate. Emphasize low-latency single-token kernels: use optimized runtimes (`Triton`/`TensorRT`) and consider CPU-optimized quantized models for ultra-short requests to avoid GPU scheduling overhead. Use extremely small max_wait and favor immediate dispatching or tiny micro-batches; implement smart routing that sends short-low-cost requests to cheaper instances and long-generation streams to GPU-backed, KV-cached instances. Here, client-side caching (reusing session context tokens on client and sending only deltas) can drastically reduce server load. Also consider model distillation or smaller specialist models for classification/short answers to reduce cost while meeting latency SLOs.

Common pitfalls

Pitfall: Underestimating KV memory. People forget the factor of num_layers and double for keys+values; consequence is running out of GPU memory under realistic session counts.

Pitfall: Treating throughput as the only metric. Optimizing for tokens/sec without measuring `p99` latency will break interactive experiences—always present both metrics.

Pitfall: Ignoring kernel and PCIe latency for offloaded KV. Saying “offload to CPU” without quantifying additional ms per token is a shallow fix; measure end-to-end.

Connections

Adjacent interviewer pivots may include model parallelism & checkpointing (how sharding interacts with KV placement), autoscaling/ SRE practices (cost vs SLA ops playbooks), and model compression or distillation (reducing hardware needs vs fidelity).