PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/Anthropic

Design an inference routing and scheduling layer

Last updated: Mar 29, 2026

Quick Overview

This question evaluates system-design and production machine learning engineering skills, specifically distributed systems, routing and scheduling, multi-tenant isolation, caching, dynamic batching, and capacity planning for heterogeneous GPU/CPU inference backends.

  • hard
  • Anthropic
  • ML System Design
  • Machine Learning Engineer

Design an inference routing and scheduling layer

Company: Anthropic

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design a routing layer between an API service and heterogeneous inference backends (GPU/CPU) that supports: 1) traffic prioritization across tenants and request classes, 2) dynamic batching, 3) a query result cache, and 4) credit-based fairness similar to GPU credits. Describe the end-to-end architecture, request lifecycle, APIs, data structures, and algorithms for prioritization, batching, and cache admission/eviction. Explain how you would handle SLAs, backpressure, hot keys, heterogeneous model sizes, multi-tenant isolation, and failure scenarios (e.g., node loss, stragglers, retries). Provide capacity planning, scaling strategy, and monitoring/alerting signals. Discuss trade-offs among latency, throughput, and cost, and how you would run experiments to tune batching and caching.

Quick Answer: This question evaluates system-design and production machine learning engineering skills, specifically distributed systems, routing and scheduling, multi-tenant isolation, caching, dynamic batching, and capacity planning for heterogeneous GPU/CPU inference backends.

Related Interview Questions

  • Design GPU inference request batching - Anthropic
  • How do you handle an LLM agents interview? - Anthropic (hard)
  • Design a prompt playground - Anthropic (medium)
  • Design a model downloader - Anthropic (medium)
  • Design a high-concurrency LLM inference service - Anthropic (hard)
Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
Machine Learning Engineer
Onsite
ML System Design
22
0

System Design: Routing Layer for Heterogeneous Inference Backends (GPU/CPU)

Context

You are asked to design a routing layer that sits between a user-facing API service and a fleet of heterogeneous inference backends (GPU and CPU). The system must serve multiple tenants and request classes while optimizing for latency, throughput, and cost.

Assume the fleet runs a mix of model types/versions and hardware tiers. Requests may be streaming (token-by-token) or non-streaming. Determinism is achievable for specific settings (e.g., temperature=0), enabling a query result cache.

Requirements

Design an end-to-end architecture that supports:

  1. Traffic prioritization across tenants and request classes
  2. Dynamic batching
  3. A query result cache
  4. Credit-based fairness (similar to GPU credits)

Describe:

  • Architecture and request lifecycle
  • External and internal APIs
  • Key data structures
  • Algorithms for prioritization, batching, and cache admission/eviction
  • Handling SLAs, backpressure, hot keys, heterogeneous model sizes, multi-tenant isolation, and failures (node loss, stragglers, retries)
  • Capacity planning, scaling strategy, and monitoring/alerting signals
  • Trade-offs among latency, throughput, and cost, and how to run experiments to tune batching and caching

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Anthropic•More Machine Learning Engineer•Anthropic Machine Learning Engineer•Anthropic ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.