Design scalable, highly available GenAI serving

Q: Design scalable, highly available GenAI serving

This question evaluates understanding of scalable, highly available generative AI inference platforms and associated competencies in distributed systems, ML model serving, autoscaling and GPU scheduling, global request routing, model/version management, stateful dependency handling, observability, and rate limiting.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

System Design: Highly Scalable, Highly Available Generative AI Inference Platform

Context

Design a production-grade deployment for a generative AI text model (decoder-only Transformer, 7B–70B parameters) serving enterprise, multi-tenant traffic. The platform must sustain high scalability and high availability across regions and handle unpredictable traffic spikes.

You may make minimal, explicit assumptions to ground your design (e.g., target SLOs for time-to-first-token and throughput, typical prompt/output lengths, GPU types).

Requirements

Describe and justify your design for the following:

Inference serving architecture
- Components and data/control planes
- Streaming vs non-streaming; batching; cache usage
Request routing
- Global and regional routing, session affinity, retries/hedging
Autoscaling (including GPU scheduling)
- Replica scaling signals, node autoscaling, bin-packing/MIG, warm pools
Multi-region strategy
- Active-active vs active-passive, failover triggers, data/control plane considerations
Model versioning and rollout
- Registry, artifact management, canary/blue-green, rollback, compatibility (tokenizer/adapters)
Stateful dependency management
- Tokenizer/embeddings versioning, KV/prompt caches, locality/affinity, external stores
Observability
- Metrics/traces/logs at model/tenant/version levels; GPU health; SLO dashboards and alerting
Rate limiting and fairness
- Per-tenant budgets, token-based limits, concurrency caps, overload protection
Meeting latency/throughput SLOs under spikes and failures
- Admission control, dynamic batching, speculative decoding, degradation and fallbacks

Provide a clear end-to-end flow and the key trade-offs behind your choices.

Design scalable, highly available GenAI serving

Quick Overview

System Design: Highly Scalable, Highly Available Generative AI Inference Platform

Context

Requirements

Solution

Comments (0)