System Design: GPU-Backed Inference Platform and API
You are designing a production inference platform to serve deep learning models (vision, ranking, and LLMs) at scale. Traffic is bursty and multi-tenant. The GPU fleet is heterogeneous (e.g., T4, A10, A100/H100), and you must support multiple models per GPU while meeting latency SLOs.
Design the system and explain key trade-offs. Specifically cover:
-
API shape
-
Synchronous vs asynchronous semantics
-
Streaming responses
-
Backwards-compatible schema and API versioning
-
Request schema and versioning
-
Request/response schemas, idempotency, and model version pinning vs aliases
-
Request routing
-
Regional routing, capacity-aware routing, and per-model queues
-
Batching strategy to maximize GPU utilization
-
Micro/continuous batching, padding, admission control; handling LLM prefill vs decode
-
Model loading and cold-start mitigation
-
Model registry, artifact formats, weight caching/warm pools
-
Multi-model hosting on shared GPUs
-
Memory management, MIG/MPS, LoRA/adapters, isolation vs utilization trade-offs
-
Autoscaling across heterogeneous nodes
-
Metrics to scale on, predictive vs reactive, bin-packing/placement
-
Placement strategy
-
Scheduling by GPU type, memory, model affinity, anti-affinity for HA
-
Observability
-
Latency (end-to-end and queueing), throughput, GPU metrics (utilization, memory, SM occupancy)
-
Rate limiting
-
Per-tenant quotas, concurrency limits, fairness
-
Multi-tenant isolation
-
Noisy-neighbor controls, security, priority tiers
-
Failure handling
-
Retries, timeouts, idempotency, circuit breakers
-
Canary and A/B rollouts, shadowing
-
High-level architecture
-
Control plane vs data plane, main components and request flows
Assume typical production SLOs (e.g., p95 latency < 300 ms for small models; streaming for LLMs), payloads up to ~10 MB, and that you need to operate multi-region.
Provide a concise, well-structured design with diagrams described in words and call out the most important trade-offs.