Design a high-concurrency LLM inference service
Company: Anthropic
Role: Software Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
You are designing an LLM inference platform that serves interactive user requests (chat/completions) on GPUs.
## Goals
- Support **high concurrency** with predictable **tail latency** (p95/p99) while maintaining good **throughput**.
- Optimize **GPU utilization** under real constraints: limited GPU memory, compute saturation, multi-tenant workloads.
- Support **streaming** token output (server-sent events / websockets) and **non-streaming** responses.
- Support **multiple models and multiple versions** of the same model (A/B, canary, rollback).
- Handle **cold start** and **hot model** lifecycle management.
- Be **cost-aware** (e.g., $/token) and able to trade off latency vs cost.
## Must-discuss topics
1. Sketch the **end-to-end inference pipeline**, explicitly separating **prefill** vs **decode** phases.
2. Explain what the **KV cache** is, what problem it solves, and its impact on memory/latency.
3. Batching strategy:
- **Static batching** vs **dynamic batching** vs **micro-batching**.
- What goes wrong when batch size is too large vs too small.
4. Scheduling under mixed request sizes:
- Long context vs short context; how that affects latency and GPU memory.
- How to prevent **tail latency** explosions and head-of-line blocking.
5. Request management:
- When/how to **split and merge** requests (e.g., chunking long prompts, speculative approaches).
6. Multi-model/version routing:
- How requests get routed to the right model/version.
- Rollout/rollback and warmup considerations.
## Deliverables
- A proposed architecture (components and responsibilities).
- Key algorithms/policies for batching and scheduling.
- Observability: the metrics you’d track and how you’d debug performance regressions.
- Clear tradeoffs and failure/overload behavior.
Quick Answer: This question evaluates a candidate's ability to design a high-concurrency LLM inference platform, assessing competencies in GPU utilization and memory management (KV cache), batching and scheduling strategies, request splitting/merging, multi-model/version routing, streaming versus non-streaming output handling, lifecycle (cold start/hot model) management, and cost-versus-latency trade-offs. Commonly asked in the ML System Design domain to gauge practical system-design skills and operational reasoning, it tests practical application—architecture, algorithms, scheduling policies, and observability—while requiring conceptual understanding of trade-offs between latency, throughput, memory, and cost.