Design an inference routing and scheduling layer
Company: Anthropic
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
Design a routing layer between an API service and heterogeneous inference backends (GPU/CPU) that supports:
1) traffic prioritization across tenants and request classes,
2) dynamic batching,
3) a query result cache, and
4) credit-based fairness similar to GPU credits. Describe the end-to-end architecture, request lifecycle, APIs, data structures, and algorithms for prioritization, batching, and cache admission/eviction. Explain how you would handle SLAs, backpressure, hot keys, heterogeneous model sizes, multi-tenant isolation, and failure scenarios (e.g., node loss, stragglers, retries). Provide capacity planning, scaling strategy, and monitoring/alerting signals. Discuss trade-offs among latency, throughput, and cost, and how you would run experiments to tune batching and caching.
Quick Answer: This question evaluates system-design and production machine learning engineering skills, specifically distributed systems, routing and scheduling, multi-tenant isolation, caching, dynamic batching, and capacity planning for heterogeneous GPU/CPU inference backends.