PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/Anthropic

Design a batch inference API

Last updated: Jun 21, 2026

Quick Overview

This question evaluates a candidate's ability to design asynchronous batch inference APIs and systems, including API schema and job lifecycle design, idempotency semantics, queueing and worker scaling, batching and accelerator utilization, rate limiting, observability, and error handling.

  • hard
  • Anthropic
  • ML System Design
  • Software Engineer

Design a batch inference API

Company: Anthropic

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design an inference service API where clients POST a job and later poll for results. Requirements: accept single or batch inputs; return a job ID on submission; provide status endpoints (queued, running, succeeded, failed); no streaming required. Specify request/response schemas, idempotency keys, timeout and retry behavior, and rate limits. Describe the job queue, workers, and storage of intermediate and final results; how you would scale workers, batch efficiently, and utilize accelerators; and how you would implement observability, error handling, and partial failures within a batch.

Quick Answer: This question evaluates a candidate's ability to design asynchronous batch inference APIs and systems, including API schema and job lifecycle design, idempotency semantics, queueing and worker scaling, batching and accelerator utilization, rate limiting, observability, and error handling.

Related Interview Questions

  • Design Model Weight Distribution - Anthropic (medium)
  • Design GPU inference request batching - Anthropic
  • How do you handle an LLM agents interview? - Anthropic (hard)
  • Design a prompt playground - Anthropic (medium)
  • Design a model downloader - Anthropic (medium)
|Home/ML System Design/Anthropic

Design a batch inference API

Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
hardSoftware EngineerOnsiteML System Design
57
0

Design an Asynchronous (POST-and-Poll) Inference Service API

Design an asynchronous inference service for serving model predictions. A client submits a single item or a batch of items in one request, immediately receives a job ID, and later polls for status and results. There is no streaming of results back to the client and no synchronous result on submission — submission only acknowledges the work and hands back a handle to track it.

Walk through the API contract, the backing architecture, and how you would operate the service in production. Assume a typical cloud environment with standard building blocks available (HTTP gateway, object storage, message queues, autoscaling, etc.).

Constraints & Assumptions

These anchor the design; treat any number you'd want firmed up as a clarifying question rather than a hard SLA.

  • Workload : GPU-served models (e.g., language or vision), where one inference is the expensive step (tens of milliseconds to seconds per item) rather than the I/O.
  • Async is intentional : clients tolerate end-to-end latency from seconds to minutes, which is what makes queuing, batching, and autoscaling worthwhile.
  • Multi-tenant : many API keys share the fleet; one tenant must not starve or crash another.
  • Batch sizes : a single item up to roughly 10410^4104 items per job; larger workloads are chunked or referenced by file.
  • Result durability : results are retained for a bounded TTL (e.g., 7 days), then purged.
  • Standard cloud primitives are available; you do not need to build a queue or object store from scratch.

Clarifying Questions to Ask

A strong candidate scopes the problem before designing. Good questions here include:

  • What is the read:write ratio ? (Polling reads vastly outnumber submissions — this shapes caching and rate limits.)
  • What latency SLA matters — time-to-acknowledge on submit, or end-to-end time-to-result? What batch sizes dominate in practice?
  • Are models single-pass , or are there multi-stage pipelines with intermediate artifacts to persist?
  • What are the payload sizes (per item and per batch), and the retention requirements for inputs and results?
  • What delivery guarantee is required — is at-least-once processing with idempotent results acceptable, or is exactly-once mandatory?
  • What is the peak request rate and the tenant isolation requirement (noisy-neighbor tolerance)?

Address each of the four parts below. Per-part hints are click-to-reveal — try the part first, then expand a hint if you're stuck.

Part 1 — API behavior

  • Accept single or batch inputs through the same submission endpoint.
  • Return a job ID immediately on submission.
  • Expose status endpoints that report one of four states: queued , running , succeeded , failed .
  • Polling only — no streaming response is required.

State precisely what each of the four statuses means, including how they behave for a mixed batch where some items succeed and others fail.

What This Part Should Cover

  • Unified contract : a single submit path for one item or many, returning a job_id immediately (the submit is an acknowledgement, not a result).
  • Precise state definitions : what each of queued | running | succeeded | failed means and the exact transition that makes a job terminal.
  • Partial-failure semantics : a defensible rule for a mixed batch that neither discards good results nor hides failures (e.g., an extra job-summary signal beyond the four states).

Part 2 — API design

  • Specify the request/response schemas for submission, status, and results.
  • Define idempotency keys and their semantics (so a retried submission doesn't create duplicate work).
  • Define timeout and retry behavior on both the client and server side.
  • Define rate limits and backpressure behavior.

What This Part Should Cover

  • Concrete schemas : well-shaped request/response bodies for submit, status, per-item results, and a consolidated result — with the status read kept deliberately small.
  • Exactly-once job creation : idempotency keyed on caller + key, race-safe under concurrent retries, with a defined behavior when the same key arrives with a different payload.
  • Retry contract on both sides : which status codes are retryable, client backoff with the same idempotency key, and server-side per-item / per-job timeouts.
  • Two distinct overload signals : per-key rate limiting vs system-wide backpressure, each with a different response code so the client knows whether to slow itself or simply wait.

Part 3 — Architecture

  • Describe the job queue , the workers , and the storage of inputs, intermediate artifacts, and final results.
  • Explain how you would scale workers , batch efficiently , and utilize accelerators (e.g., GPUs).

What This Part Should Cover

  • Decoupled topology : stateless CPU ingress, a durable queue, and GPU workers that scale independently, with the at-least-once delivery semantics that implies.
  • Storage split : large blobs (inputs, intermediates, results) in object storage; only pointers and small metadata in the DB; a clear home for multi-stage intermediates.
  • Accelerator utilization : dynamic/micro-batching with a fire trigger, shape bucketing to bound padding waste, continuous batching for generative decoding, and warm weights.
  • Scaling signal : the autoscaler watches a leading indicator (queue depth / target wait) rather than only GPU utilization, plus priority lanes and per-model pools.

Part 4 — Operability

  • Implement observability : metrics, logs, and tracing.
  • Define error handling and a standardized error schema .
  • Handle partial failures within a batch (some items succeed while others fail).

What This Part Should Cover

  • Actionable metrics : latency percentiles plus efficiency signals (batch-size distribution, GPU utilization) that reveal whether batching works and whether GPUs are over- or under-provisioned.
  • One error envelope : a stable machine-readable code and retryable flag everywhere, so clients branch on fields not prose.
  • Transient vs permanent handling : retry-with-backoff for transient errors, immediate fail + DLQ for permanent ones.
  • Partial-failure roll-up in practice : counts the client can act on, a way to list only failed items, and a consolidated artifact containing both successes and failures.

What a Strong Answer Covers

These cross-cutting dimensions span all four parts; an interviewer weighs them across the whole design (these are dimensions, not the answers):

  • Capacity reasoning : a back-of-envelope estimate (items/sec, GPU-seconds per item, payload sizes) that justifies the architecture rather than reciting components.
  • Correctness under retries end-to-end : exactly-once job creation via idempotency (Part 2) combined with at-least-once processing made safe by idempotent per-item write-back (Part 3) — the two halves must compose.
  • One coherent contract : the API shapes (Part 2) match the state and partial-failure semantics (Parts 1 and 4) and the storage/result paths (Part 3) without contradiction.
  • Tradeoffs and isolation : naming the cost of each major decision and when to revisit it, plus multi-tenant fairness and security (TLS, signed URLs, least privilege) that run through every part.

Follow-up Questions

  • How does the design change at 100x scale (e.g., 100k items/sec) — what breaks first, ingress, the queue, the metadata DB, or GPU supply?
  • A client submits a 10k-item job and polls aggressively. How do you keep status reads cheap and return results efficiently without paging through 10k items?
  • A worker crashes after running inference but before acknowledging the message. Walk through exactly what happens and why no result is double-counted or lost.
  • One model version is repointed mid-flight. How do in-flight jobs and later polls stay consistent about which model actually ran?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic ML System Design•Software Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.