This question evaluates system-design and production machine learning engineering skills, specifically distributed systems, routing and scheduling, multi-tenant isolation, caching, dynamic batching, and capacity planning for heterogeneous GPU/CPU inference backends.

You are asked to design a routing layer that sits between a user-facing API service and a fleet of heterogeneous inference backends (GPU and CPU). The system must serve multiple tenants and request classes while optimizing for latency, throughput, and cost.
Assume the fleet runs a mix of model types/versions and hardware tiers. Requests may be streaming (token-by-token) or non-streaming. Determinism is achievable for specific settings (e.g., temperature=0), enabling a query result cache.
Design an end-to-end architecture that supports:
Describe:
Login required