System Design: Routing Layer for Heterogeneous Inference Backends (GPU/CPU)
Context
You are asked to design a routing layer that sits between a user-facing API service and a fleet of heterogeneous inference backends (GPU and CPU). The system must serve multiple tenants and request classes while optimizing for latency, throughput, and cost.
Assume the fleet runs a mix of model types/versions and hardware tiers. Requests may be streaming (token-by-token) or non-streaming. Determinism is achievable for specific settings (e.g., temperature=0), enabling a query result cache.
Requirements
Design an end-to-end architecture that supports:
-
Traffic prioritization across tenants and request classes
-
Dynamic batching
-
A query result cache
-
Credit-based fairness (similar to GPU credits)
Describe:
-
Architecture and request lifecycle
-
External and internal APIs
-
Key data structures
-
Algorithms for prioritization, batching, and cache admission/eviction
-
Handling SLAs, backpressure, hot keys, heterogeneous model sizes, multi-tenant isolation, and failures (node loss, stragglers, retries)
-
Capacity planning, scaling strategy, and monitoring/alerting signals
-
Trade-offs among latency, throughput, and cost, and how to run experiments to tune batching and caching