PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/ML System Design/Anthropic

Design a batched inference API

Last updated: Apr 22, 2026

Quick Overview

This question evaluates competency in designing scalable, low-latency ML inference systems with dynamic batching, covering system architecture, request batching and scheduling, model routing/versioning, and operational concerns such as autoscaling, reliability, timeouts, and observability.

  • hard
  • Anthropic
  • ML System Design
  • Software Engineer

Design a batched inference API

Company: Anthropic

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design an online machine learning inference service that supports dynamic batching. Multiple clients send small synchronous prediction requests to an API. Running each request individually wastes GPU capacity, so the system should combine compatible requests into batches before model execution. At the same time, the service must still meet a latency SLO for online traffic. Discuss: - The external API and request/response schema - How requests are grouped into compatible batches - Queueing and scheduling logic, including max batch size and max wait time - Handling variable input sizes or sequence lengths - Model routing and versioning - Timeouts, cancellations, and partial failures - Autoscaling, reliability, and observability - Trade-offs between latency, throughput, and cost

Quick Answer: This question evaluates competency in designing scalable, low-latency ML inference systems with dynamic batching, covering system architecture, request batching and scheduling, model routing/versioning, and operational concerns such as autoscaling, reliability, timeouts, and observability.

Related Interview Questions

  • Design Model Weight Distribution - Anthropic (medium)
  • Design GPU inference request batching - Anthropic
  • How do you handle an LLM agents interview? - Anthropic (hard)
  • Design a prompt playground - Anthropic (medium)
  • Design a model downloader - Anthropic (medium)
Anthropic logo
Anthropic
Feb 8, 2026, 12:00 AM
Software Engineer
Onsite
ML System Design
4
0
Loading...

Design an online machine learning inference service that supports dynamic batching.

Multiple clients send small synchronous prediction requests to an API. Running each request individually wastes GPU capacity, so the system should combine compatible requests into batches before model execution. At the same time, the service must still meet a latency SLO for online traffic.

Discuss:

  • The external API and request/response schema
  • How requests are grouped into compatible batches
  • Queueing and scheduling logic, including max batch size and max wait time
  • Handling variable input sizes or sequence lengths
  • Model routing and versioning
  • Timeouts, cancellations, and partial failures
  • Autoscaling, reliability, and observability
  • Trade-offs between latency, throughput, and cost

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic ML System Design•Software Engineer ML System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.