Design an inference routing and scheduling layer

Q: Design an inference routing and scheduling layer

This is a ML System Design interview question from Anthropic for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

System Design: Routing Layer for Heterogeneous Inference Backends (GPU/CPU)

Context

You are asked to design a routing layer that sits between a user-facing API service and a fleet of heterogeneous inference backends (GPU and CPU). The system must serve multiple tenants and request classes while optimizing for latency, throughput, and cost.

Assume the fleet runs a mix of model types/versions and hardware tiers. Requests may be streaming (token-by-token) or non-streaming. Determinism is achievable for specific settings (e.g., temperature=0), enabling a query result cache.

Requirements

Design an end-to-end architecture that supports:

Traffic prioritization across tenants and request classes
Dynamic batching
A query result cache
Credit-based fairness (similar to GPU credits)

Describe:

Architecture and request lifecycle
External and internal APIs
Key data structures
Algorithms for prioritization, batching, and cache admission/eviction
Handling SLAs, backpressure, hot keys, heterogeneous model sizes, multi-tenant isolation, and failures (node loss, stragglers, retries)
Capacity planning, scaling strategy, and monitoring/alerting signals
Trade-offs among latency, throughput, and cost, and how to run experiments to tune batching and caching

Design an inference routing and scheduling layer

System Design: Routing Layer for Heterogeneous Inference Backends (GPU/CPU)

Context

Requirements

Solution (Locked)

Comments (0)