System Design Review: Machine-Learning Inference API (Distributed Systems Focus)
Background
You are reviewing a teammate’s design document for a production machine-learning inference API that serves text-generation models (e.g., chat/completions) with token streaming. The service is multi-tenant and must run across multiple availability zones with GPUs/accelerators.
Assume typical LLM workloads (prompt prefill + token-by-token decode), dynamic batching, and a mix of small and large model SKUs. The system must support safe model rollouts, strong SLOs, and cost controls.
What to Deliver
Critique the design and propose concrete improvements addressing the following areas:
-
Product and SLOs
-
Clarify product scope and APIs (sync/streaming, embeddings vs. generations).
-
Define latency SLOs (e.g., time-to-first-token, per-token latency) and availability SLOs with an explicit error budget.
-
Throughput and Capacity Planning
-
Estimate QPS and token throughput given model characteristics (prefill/decode tokens/s per GPU) and average request sizes.
-
Size headroom, concurrency, and regional capacity.
-
Autoscaling, Batching, and Accelerator Scheduling
-
Propose request-driven autoscaling signals.
-
Describe dynamic batching windows and batching policies.
-
Plan GPU/accelerator scheduling (MIG, packing, preemption), and warm pools.
-
Model Loading, Versioning, and Rollback
-
Immutable model versions and registry.
-
Preload/warm mechanisms, safe rolling updates, canaries, and fast rollback.
-
Multi-tenant Isolation and Rate Limiting
-
Per-tenant quotas, concurrency caps, and weighted-fair queuing.
-
Isolation strategies across CPU/GPU/memory.
-
Overload and Resilience
-
Backpressure, admission control, bounded queues, and circuit breakers.
-
Queue TTLs, shedding policy, and graceful degradation.
-
Idempotency, Retries, and Timeouts
-
Idempotency keys and duplicate suppression.
-
Retry policies with deadlines and jitter; cancellation propagation.
-
Cold-start Mitigation
-
Weight caching, prewarming, snapshotting/restore, and warm pools.
-
Caching and Streaming
-
Caching for model weights and KV/prompt prefixes.
-
Response/token streaming protocol and flush policy.
-
Traffic Shaping and Rollouts
-
Canary, A/B, and shadow traffic.
-
Safe rollback plans and blast-radius limits.
-
Monitoring, Alerting, and Error Budgets
-
SLIs, dashboards, burn-rate alerts, per-tenant and per-model views.
-
Privacy, Safety, Audit, and Cost Controls
-
Data retention, encryption, safety filters, audit logs.
-
Cost budgets, spend alerts, and efficiency levers.
-
High-level Architecture and Trade-offs
-
Provide a logical architecture and discuss key trade-offs (latency vs. throughput, isolation vs. utilization, complexity vs. operability).