PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/Anthropic

Design an inference routing and scheduling layer

Last updated: Mar 29, 2026

Quick Overview

This question evaluates system-design and production machine learning engineering skills, specifically distributed systems, routing and scheduling, multi-tenant isolation, caching, dynamic batching, and capacity planning for heterogeneous GPU/CPU inference backends.

  • hard
  • Anthropic
  • ML System Design
  • Machine Learning Engineer

Design an inference routing and scheduling layer

Company: Anthropic

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design a routing layer between an API service and heterogeneous inference backends (GPU/CPU) that supports: 1) traffic prioritization across tenants and request classes, 2) dynamic batching, 3) a query result cache, and 4) credit-based fairness similar to GPU credits. Describe the end-to-end architecture, request lifecycle, APIs, data structures, and algorithms for prioritization, batching, and cache admission/eviction. Explain how you would handle SLAs, backpressure, hot keys, heterogeneous model sizes, multi-tenant isolation, and failure scenarios (e.g., node loss, stragglers, retries). Provide capacity planning, scaling strategy, and monitoring/alerting signals. Discuss trade-offs among latency, throughput, and cost, and how you would run experiments to tune batching and caching.

Quick Answer: This question evaluates system-design and production machine learning engineering skills, specifically distributed systems, routing and scheduling, multi-tenant isolation, caching, dynamic batching, and capacity planning for heterogeneous GPU/CPU inference backends.

Related Interview Questions

  • Design Model Weight Distribution - Anthropic (medium)
  • Design GPU inference request batching - Anthropic
  • How do you handle an LLM agents interview? - Anthropic (hard)
  • Design a prompt playground - Anthropic (medium)
  • Design a model downloader - Anthropic (medium)
|Home/ML System Design/Anthropic

Design an inference routing and scheduling layer

Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
hardMachine Learning EngineerOnsiteML System Design
27
0

System Design: Routing Layer for Heterogeneous Inference Backends (GPU/CPU)

Context

You are asked to design a routing layer that sits between a user-facing API service and a fleet of heterogeneous inference backends (GPU and CPU). The system must serve multiple tenants and request classes while optimizing for latency, throughput, and cost.

Assume the fleet runs a mix of model types/versions and hardware tiers. Requests may be streaming (token-by-token) or non-streaming. Determinism is achievable for specific settings (e.g., temperature=0), enabling a query result cache.

Requirements

Design an end-to-end architecture that supports:

  1. Traffic prioritization across tenants and request classes
  2. Dynamic batching
  3. A query result cache
  4. Credit-based fairness (similar to GPU credits)

Describe:

  • Architecture and request lifecycle
  • External and internal APIs
  • Key data structures
  • Algorithms for prioritization, batching, and cache admission/eviction
  • Handling SLAs, backpressure, hot keys, heterogeneous model sizes, multi-tenant isolation, and failures (node loss, stragglers, retries)
  • Capacity planning, scaling strategy, and monitoring/alerting signals
  • Trade-offs among latency, throughput, and cost, and how to run experiments to tune batching and caching

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Anthropic•More Machine Learning Engineer•Anthropic Machine Learning Engineer•Anthropic ML System Design•Machine Learning Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.