You are designing an LLM inference platform that serves interactive user requests (chat/completions) on GPUs.
Goals
-
Support
high concurrency
with predictable
tail latency
(p95/p99) while maintaining good
throughput
.
-
Optimize
GPU utilization
under real constraints: limited GPU memory, compute saturation, multi-tenant workloads.
-
Support
streaming
token output (server-sent events / websockets) and
non-streaming
responses.
-
Support
multiple models and multiple versions
of the same model (A/B, canary, rollback).
-
Handle
cold start
and
hot model
lifecycle management.
-
Be
cost-aware
(e.g., $/token) and able to trade off latency vs cost.
Must-discuss topics
-
Sketch the
end-to-end inference pipeline
, explicitly separating
prefill
vs
decode
phases.
-
Explain what the
KV cache
is, what problem it solves, and its impact on memory/latency.
-
Batching strategy:
-
Static batching
vs
dynamic batching
vs
micro-batching
.
-
What goes wrong when batch size is too large vs too small.
-
Scheduling under mixed request sizes:
-
Long context vs short context; how that affects latency and GPU memory.
-
How to prevent
tail latency
explosions and head-of-line blocking.
-
Request management:
-
When/how to
split and merge
requests (e.g., chunking long prompts, speculative approaches).
-
Multi-model/version routing:
-
How requests get routed to the right model/version.
-
Rollout/rollback and warmup considerations.
Deliverables
-
A proposed architecture (components and responsibilities).
-
Key algorithms/policies for batching and scheduling.
-
Observability: the metrics you’d track and how you’d debug performance regressions.
-
Clear tradeoffs and failure/overload behavior.