System Design: GPU-Backed Multi-Model Inference API
Context
Design a production-grade inference platform for serving multiple ML models (e.g., LLMs, vision, and classic DL models) backed by GPUs. The platform must meet strict latency SLOs for online traffic while achieving high throughput via dynamic batching. It should support model versioning with A/B routing, autoscale across heterogeneous GPU nodes, provide isolation and quotas for multiple tenants, and remain fault-tolerant.
Assume global deployment in at least two regions, gRPC/HTTP-based clients, and a mix of streaming and unary requests.
Requirements
-
SLOs and Latency: Low-latency online inference with clear SLOs (e.g., P95 time-to-first-byte/token and completion). High availability.
-
Throughput: Dynamic batching while respecting SLOs.
-
Models: Multiple models, versioning, A/B testing and routing.
-
Autoscaling: Across heterogeneous GPU node types (e.g., A10, A100, H100), with pre-warming and scale-to-zero support.
-
Queueing and Backpressure: Fairness and SLO-aware admission control.
-
Multi-tenant Isolation: Per-tenant quotas, budgets, and safety limits.
-
Lifecycle: Model loading, warmup, cache priming.
-
GPU Memory Management: Weights residency, KV/cache planning, and eviction.
-
Fault Tolerance: Retries, draining, canarying, and rollbacks.
-
Architecture: Describe API gateway, scheduler/router, batching layer, workers, model registry, and control plane.
-
Protocols: Streaming vs unary RPC.
-
Observability: Metrics, traces, logs.
-
Cost Controls: Efficiency, scaling policies, and spend guardrails.
-
Security: AuthN/Z, isolation, artifact integrity, and data protection.