PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/ML System Design/Anthropic

Design GPU inference request batching

Last updated: Jun 17, 2026

Quick Overview

This question evaluates a candidate's competency in ML System Design, focusing on GPU-based inference batching, scheduling, autoscaling, and the operational concerns of routing, compatibility, and fault-tolerance.

  • Anthropic
  • ML System Design
  • Software Engineer

Design GPU inference request batching

Company: Anthropic

Role: Software Engineer

Category: ML System Design

Interview Round: Onsite

Design a system that serves **online model-inference requests on GPUs**. Requests arrive one at a time from clients, but GPU throughput is far higher when compatible requests are grouped into batches: a larger batch amortizes the fixed per-step cost (kernel launches, reading weights from HBM) across more requests. Every request you add to a batch, however, makes earlier-arriving requests wait — so the system must form the largest *useful* batch it can without blowing any single request's latency budget. Design a service that: - accepts low-latency inference requests over an online API, - batches *compatible* requests together, - routes work to GPU workers, - supports multiple models and model versions concurrently, - balances throughput (cost per request) against latency SLOs, - handles overload, failures, and observability. Your design should cover the **API**, the **queueing model**, the **batching strategy and scheduling policy**, the **worker lifecycle**, the **autoscaling signals**, and the **main trade-offs**. ```hint Where to start Frame the whole design around one tension: a bigger batch improves GPU efficiency but forces earlier requests to wait. Almost every decision (batch size, wait time, bucketing) is a point on that throughput-vs-latency curve. It helps to break end-to-end latency into stages so you can reason about which one the batching layer actually controls — and which the rest of the system has to keep small and predictable. ``` ```hint What "compatible" means Two requests can only share one kernel call if they agree on everything that defines the computation — enumerate what those attributes are, and watch for the one whose mismatch is a *correctness* bug rather than just an efficiency loss. The subtler attribute is input shape: padding a 16-token request up to a 2,000-token batch-mate means it pays 2,000-token compute. Think about how you'd group by length and what trade-off finer grouping creates. ``` ```hint The scheduler's flush rule A batch can't grow forever — so what makes the scheduler stop waiting and dispatch? List the distinct triggers you'd want; aim for more than the obvious "it's full." For any time-based "linger" limit, ask what number it can take *without* eating the whole SLO: a strong answer ties it to the budget rather than picking a round number. ``` ```hint The LLM-specific twist A static "form one batch, run it to completion, return" rule behaves very differently when each request emits a variable, unknown number of output tokens than when every request is a single fixed-cost forward pass. Reason about what happens to a short reply that shares a static batch with a very long one, and about what frees up (or doesn't) when one sequence finishes mid-batch. That should push you toward a different scheduling granularity — and a different binding resource — for the generation case. ``` ```hint Autoscaling pitfall GPU utilization alone is a trap: you can see low utilization and still miss the SLO when traffic is fragmented across incompatible buckets that each run tiny batches. Think about what *leading* signal best predicts SLO risk. ``` ### Constraints & Assumptions State your own where the interviewer leaves them open, but a reasonable default scenario: - **Online, synchronous-ish API** with a tail latency SLO — e.g. p95 of a few hundred ms for a fixed-cost model, or p95 *time-to-first-token* plus a per-token target for autoregressive generation. - **Heterogeneous workload:** multiple distinct models/versions, a mix of input shapes (e.g. text sequence lengths, image sizes), and a request-rate that varies diurnally with spikes. - **Multi-tenant:** several clients share the fleet; no single tenant should be able to starve the others. - **Inference is read-only / side-effect-free** — there is no external state to corrupt, which shapes how you think about retries and idempotency. - GPU capacity is the scarce, expensive resource; GPU pods are slow (tens of seconds) to spin up. ### Clarifying Questions to Ask - Is the workload **fixed-cost** (classifiers, embeddings, rerankers, a single forward pass) or **autoregressive generation** (variable, unknown output length)? They need fundamentally different schedulers. - What is the exact latency SLO, and is it on end-to-end latency, time-to-first-token, or per-token throughput? - What is the expected QPS, the request-size distribution (sequence lengths / image sizes), and how many distinct models and versions must be served at once? - Is the system multi-tenant with fairness/quota requirements, or single-tenant? - How strict is the durability requirement — is "a crashed in-flight request just times out and the client retries" acceptable, or must no request be lost? - Is streaming (token-by-token) output required, or only a single final response? ### What a Strong Answer Covers - A clear statement of the **central tension** (batch size vs. tail latency) and a latency decomposition that isolates what the batching layer controls. - A precise definition of the **batch key / compatibility**, including which request attributes must agree, the distinction between a mismatch that wastes compute and one that is a correctness bug, and shape/length bucketing. - A concrete **batching policy**: a multi-condition flush rule and a principled way to derive the linger time from the SLO rather than guessing. - Recognition that **autoregressive generation needs a different scheduler** than fixed-cost models, with a justified account of what changes about scheduling granularity and which resource becomes the binding admission constraint. - An end-to-end **request flow**, a **worker lifecycle** (warm weights, readiness/drain, version rollout, failure isolation), and **overload handling** via admission control / backpressure / graceful degradation. - **Autoscaling on signals that actually predict SLO risk** — reasoning about why raw GPU utilization can mislead — and **observability** that can distinguish "slow because we wait to batch" from "slow because the model is slow." - An explicit list of **trade-offs** and a sensible **"what I'd ship first"** that starts simple and specializes only when justified. ### Follow-up Questions - How does your scheduler change for **autoregressive LLM generation**, and what becomes the binding capacity constraint once you adopt continuous batching? - A single tenant suddenly sends a burst of very long-sequence requests that fill a length bucket no one else uses. Walk through what happens to other tenants' latency and how your design prevents starvation. - A worker crashes mid-batch. Which requests are affected, what do you retry, and why is retrying safe (or not) here? - The dashboard shows **low GPU utilization but rising p99 latency**. What is the likely cause, and what would you tune?

Quick Answer: This question evaluates a candidate's competency in ML System Design, focusing on GPU-based inference batching, scheduling, autoscaling, and the operational concerns of routing, compatibility, and fault-tolerance.

Related Interview Questions

  • Design Model Weight Distribution - Anthropic (medium)
  • How do you handle an LLM agents interview? - Anthropic (hard)
  • Design a prompt playground - Anthropic (medium)
  • Design a model downloader - Anthropic (medium)
  • Design a high-concurrency LLM inference service - Anthropic (hard)
Anthropic logo
Anthropic
Mar 13, 2026, 12:00 AM
Software Engineer
Onsite
ML System Design
89
0

Design a system that serves online model-inference requests on GPUs. Requests arrive one at a time from clients, but GPU throughput is far higher when compatible requests are grouped into batches: a larger batch amortizes the fixed per-step cost (kernel launches, reading weights from HBM) across more requests. Every request you add to a batch, however, makes earlier-arriving requests wait — so the system must form the largest useful batch it can without blowing any single request's latency budget.

Design a service that:

  • accepts low-latency inference requests over an online API,
  • batches compatible requests together,
  • routes work to GPU workers,
  • supports multiple models and model versions concurrently,
  • balances throughput (cost per request) against latency SLOs,
  • handles overload, failures, and observability.

Your design should cover the API, the queueing model, the batching strategy and scheduling policy, the worker lifecycle, the autoscaling signals, and the main trade-offs.

Constraints & Assumptions

State your own where the interviewer leaves them open, but a reasonable default scenario:

  • Online, synchronous-ish API with a tail latency SLO — e.g. p95 of a few hundred ms for a fixed-cost model, or p95 time-to-first-token plus a per-token target for autoregressive generation.
  • Heterogeneous workload: multiple distinct models/versions, a mix of input shapes (e.g. text sequence lengths, image sizes), and a request-rate that varies diurnally with spikes.
  • Multi-tenant: several clients share the fleet; no single tenant should be able to starve the others.
  • Inference is read-only / side-effect-free — there is no external state to corrupt, which shapes how you think about retries and idempotency.
  • GPU capacity is the scarce, expensive resource; GPU pods are slow (tens of seconds) to spin up.

Clarifying Questions to Ask

  • Is the workload fixed-cost (classifiers, embeddings, rerankers, a single forward pass) or autoregressive generation (variable, unknown output length)? They need fundamentally different schedulers.
  • What is the exact latency SLO, and is it on end-to-end latency, time-to-first-token, or per-token throughput?
  • What is the expected QPS, the request-size distribution (sequence lengths / image sizes), and how many distinct models and versions must be served at once?
  • Is the system multi-tenant with fairness/quota requirements, or single-tenant?
  • How strict is the durability requirement — is "a crashed in-flight request just times out and the client retries" acceptable, or must no request be lost?
  • Is streaming (token-by-token) output required, or only a single final response?

What a Strong Answer Covers

  • A clear statement of the central tension (batch size vs. tail latency) and a latency decomposition that isolates what the batching layer controls.
  • A precise definition of the batch key / compatibility , including which request attributes must agree, the distinction between a mismatch that wastes compute and one that is a correctness bug, and shape/length bucketing.
  • A concrete batching policy : a multi-condition flush rule and a principled way to derive the linger time from the SLO rather than guessing.
  • Recognition that autoregressive generation needs a different scheduler than fixed-cost models, with a justified account of what changes about scheduling granularity and which resource becomes the binding admission constraint.
  • An end-to-end request flow , a worker lifecycle (warm weights, readiness/drain, version rollout, failure isolation), and overload handling via admission control / backpressure / graceful degradation.
  • Autoscaling on signals that actually predict SLO risk — reasoning about why raw GPU utilization can mislead — and observability that can distinguish "slow because we wait to batch" from "slow because the model is slow."
  • An explicit list of trade-offs and a sensible "what I'd ship first" that starts simple and specializes only when justified.

Follow-up Questions

  • How does your scheduler change for autoregressive LLM generation , and what becomes the binding capacity constraint once you adopt continuous batching?
  • A single tenant suddenly sends a burst of very long-sequence requests that fill a length bucket no one else uses. Walk through what happens to other tenants' latency and how your design prevents starvation.
  • A worker crashes mid-batch. Which requests are affected, what do you retry, and why is retrying safe (or not) here?
  • The dashboard shows low GPU utilization but rising p99 latency . What is the likely cause, and what would you tune?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic ML System Design•Software Engineer ML System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.