Design a high-concurrency LLM inference service

Q: Design a high-concurrency LLM inference service

This is a ML System Design interview question from Anthropic for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Loading...

You are designing an LLM inference platform that serves interactive user requests (chat/completions) on GPUs.

Goals

Support high concurrency with predictable tail latency (p95/p99) while maintaining good throughput .
Optimize GPU utilization under real constraints: limited GPU memory, compute saturation, multi-tenant workloads.
Support streaming token output (server-sent events / websockets) and non-streaming responses.
Support multiple models and multiple versions of the same model (A/B, canary, rollback).
Handle cold start and hot model lifecycle management.
Be cost-aware (e.g., $/token) and able to trade off latency vs cost.

Must-discuss topics

Sketch the end-to-end inference pipeline , explicitly separating prefill vs decode phases.
Explain what the KV cache is, what problem it solves, and its impact on memory/latency.
Batching strategy:
- Static batching vs dynamic batching vs micro-batching .
- What goes wrong when batch size is too large vs too small.
Scheduling under mixed request sizes:
- Long context vs short context; how that affects latency and GPU memory.
- How to prevent tail latency explosions and head-of-line blocking.
Request management:
- When/how to split and merge requests (e.g., chunking long prompts, speculative approaches).
Multi-model/version routing:
- How requests get routed to the right model/version.
- Rollout/rollback and warmup considerations.

Deliverables

A proposed architecture (components and responsibilities).
Key algorithms/policies for batching and scheduling.
Observability: the metrics you’d track and how you’d debug performance regressions.
Clear tradeoffs and failure/overload behavior.

Design a high-concurrency LLM inference service

Goals

Must-discuss topics

Deliverables

Solution

Comments (0)