This question evaluates understanding of scalable, highly available generative AI inference platforms and associated competencies in distributed systems, ML model serving, autoscaling and GPU scheduling, global request routing, model/version management, stateful dependency handling, observability, and rate limiting.
Design a production-grade deployment for a generative AI text model (decoder-only Transformer, 7B–70B parameters) serving enterprise, multi-tenant traffic. The platform must sustain high scalability and high availability across regions and handle unpredictable traffic spikes.
You may make minimal, explicit assumptions to ground your design (e.g., target SLOs for time-to-first-token and throughput, typical prompt/output lengths, GPU types).
Describe and justify your design for the following:
Provide a clear end-to-end flow and the key trade-offs behind your choices.
Login required