System Design: Highly Scalable, Highly Available Generative AI Inference Platform
Context
Design a production-grade deployment for a generative AI text model (decoder-only Transformer, 7B–70B parameters) serving enterprise, multi-tenant traffic. The platform must sustain high scalability and high availability across regions and handle unpredictable traffic spikes.
You may make minimal, explicit assumptions to ground your design (e.g., target SLOs for time-to-first-token and throughput, typical prompt/output lengths, GPU types).
Requirements
Describe and justify your design for the following:
-
Inference serving architecture
-
Components and data/control planes
-
Streaming vs non-streaming; batching; cache usage
-
Request routing
-
Global and regional routing, session affinity, retries/hedging
-
Autoscaling (including GPU scheduling)
-
Replica scaling signals, node autoscaling, bin-packing/MIG, warm pools
-
Multi-region strategy
-
Active-active vs active-passive, failover triggers, data/control plane considerations
-
Model versioning and rollout
-
Registry, artifact management, canary/blue-green, rollback, compatibility (tokenizer/adapters)
-
Stateful dependency management
-
Tokenizer/embeddings versioning, KV/prompt caches, locality/affinity, external stores
-
Observability
-
Metrics/traces/logs at model/tenant/version levels; GPU health; SLO dashboards and alerting
-
Rate limiting and fairness
-
Per-tenant budgets, token-based limits, concurrency caps, overload protection
-
Meeting latency/throughput SLOs under spikes and failures
-
Admission control, dynamic batching, speculative decoding, degradation and fallbacks
Provide a clear end-to-end flow and the key trade-offs behind your choices.