This question evaluates expertise in designing scalable machine-learning inference systems, covering chat-completion architecture, GPU capacity planning for large transformers, and stateful KV-cache design, layout, latency, and consistency considerations.
Design a ChatGPT-like system for inference serving.
Your design discussion should cover:
Assume modern datacenter GPUs (e.g., 80GB class) and high-throughput networking. State any assumptions you make (context length, throughput targets, replication, etc.).