System Design: Cloud AI Inference Platform for Real-Time and Batch
Context
Design a multi-tenant cloud platform that serves machine learning models for both real-time (online) and batch (offline) workloads. The platform should support multiple model frameworks and versions, meet latency and throughput targets, and provide strong isolation, observability, and cost control.
Requirements
-
Model Packaging and Versioning
-
How will models (artifacts + code + environment) be packaged, versioned, and promoted across environments?
-
Hardware Selection
-
When to use CPU vs GPU for different models/workloads? Consider quantization/compilation and GPU sharing.
-
Autoscaling Strategy
-
Horizontal/vertical autoscaling for online and batch; scale-to-zero for idle models.
-
Request Routing
-
Routing by model/version/tenant; consistent hashing; priority and rate limiting; streaming for tokens.
-
Multi-Tenant Isolation
-
Compute/storage/network isolation; quotas; fairness; GPU partitioning.
-
SLO Targets
-
Propose latency and throughput targets for representative model types (e.g., small classifiers, embeddings, LLM text generation) and batch SLAs.
-
Cost Controls
-
Right-sizing; spot/preemptible instances; model optimization (quantization, distillation); bin packing; budgets/chargeback.
-
Observability
-
Tracing, metrics, logs; SLI/SLOs; GPU metrics; model quality/drift.
-
Safe Rollouts
-
Canary and shadow deployments; rollback criteria; guardrails.
-
Failure Handling
-
Retries/backoff; circuit breaking; regional failover; GPU OOM; degraded modes.
-
Kubernetes Integration
-
Ingress, scheduling (GPUs/MIG), operators/CRDs, HPA/KEDA, service mesh, persistent caches.
-
Model Registry Integration
-
Registry choices, signatures/schemas, lineage; CI/CD from registry to serving.
-
Industry Comparison
-
Briefly compare to managed offerings and open-source stacks.