Design GPU inference request batching
Company: Anthropic
Role: Software Engineer
Category: ML System Design
Interview Round: Onsite
Design a system that serves **online model-inference requests on GPUs**. Requests arrive one at a time from clients, but GPU throughput is far higher when compatible requests are grouped into batches: a larger batch amortizes the fixed per-step cost (kernel launches, reading weights from HBM) across more requests. Every request you add to a batch, however, makes earlier-arriving requests wait — so the system must form the largest *useful* batch it can without blowing any single request's latency budget.
Design a service that:
- accepts low-latency inference requests over an online API,
- batches *compatible* requests together,
- routes work to GPU workers,
- supports multiple models and model versions concurrently,
- balances throughput (cost per request) against latency SLOs,
- handles overload, failures, and observability.
Your design should cover the **API**, the **queueing model**, the **batching strategy and scheduling policy**, the **worker lifecycle**, the **autoscaling signals**, and the **main trade-offs**.
```hint Where to start
Frame the whole design around one tension: a bigger batch improves GPU efficiency but forces earlier requests to wait. Almost every decision (batch size, wait time, bucketing) is a point on that throughput-vs-latency curve. It helps to break end-to-end latency into stages so you can reason about which one the batching layer actually controls — and which the rest of the system has to keep small and predictable.
```
```hint What "compatible" means
Two requests can only share one kernel call if they agree on everything that defines the computation — enumerate what those attributes are, and watch for the one whose mismatch is a *correctness* bug rather than just an efficiency loss. The subtler attribute is input shape: padding a 16-token request up to a 2,000-token batch-mate means it pays 2,000-token compute. Think about how you'd group by length and what trade-off finer grouping creates.
```
```hint The scheduler's flush rule
A batch can't grow forever — so what makes the scheduler stop waiting and dispatch? List the distinct triggers you'd want; aim for more than the obvious "it's full." For any time-based "linger" limit, ask what number it can take *without* eating the whole SLO: a strong answer ties it to the budget rather than picking a round number.
```
```hint The LLM-specific twist
A static "form one batch, run it to completion, return" rule behaves very differently when each request emits a variable, unknown number of output tokens than when every request is a single fixed-cost forward pass. Reason about what happens to a short reply that shares a static batch with a very long one, and about what frees up (or doesn't) when one sequence finishes mid-batch. That should push you toward a different scheduling granularity — and a different binding resource — for the generation case.
```
```hint Autoscaling pitfall
GPU utilization alone is a trap: you can see low utilization and still miss the SLO when traffic is fragmented across incompatible buckets that each run tiny batches. Think about what *leading* signal best predicts SLO risk.
```
### Constraints & Assumptions
State your own where the interviewer leaves them open, but a reasonable default scenario:
- **Online, synchronous-ish API** with a tail latency SLO — e.g. p95 of a few hundred ms for a fixed-cost model, or p95 *time-to-first-token* plus a per-token target for autoregressive generation.
- **Heterogeneous workload:** multiple distinct models/versions, a mix of input shapes (e.g. text sequence lengths, image sizes), and a request-rate that varies diurnally with spikes.
- **Multi-tenant:** several clients share the fleet; no single tenant should be able to starve the others.
- **Inference is read-only / side-effect-free** — there is no external state to corrupt, which shapes how you think about retries and idempotency.
- GPU capacity is the scarce, expensive resource; GPU pods are slow (tens of seconds) to spin up.
### Clarifying Questions to Ask
- Is the workload **fixed-cost** (classifiers, embeddings, rerankers, a single forward pass) or **autoregressive generation** (variable, unknown output length)? They need fundamentally different schedulers.
- What is the exact latency SLO, and is it on end-to-end latency, time-to-first-token, or per-token throughput?
- What is the expected QPS, the request-size distribution (sequence lengths / image sizes), and how many distinct models and versions must be served at once?
- Is the system multi-tenant with fairness/quota requirements, or single-tenant?
- How strict is the durability requirement — is "a crashed in-flight request just times out and the client retries" acceptable, or must no request be lost?
- Is streaming (token-by-token) output required, or only a single final response?
### What a Strong Answer Covers
- A clear statement of the **central tension** (batch size vs. tail latency) and a latency decomposition that isolates what the batching layer controls.
- A precise definition of the **batch key / compatibility**, including which request attributes must agree, the distinction between a mismatch that wastes compute and one that is a correctness bug, and shape/length bucketing.
- A concrete **batching policy**: a multi-condition flush rule and a principled way to derive the linger time from the SLO rather than guessing.
- Recognition that **autoregressive generation needs a different scheduler** than fixed-cost models, with a justified account of what changes about scheduling granularity and which resource becomes the binding admission constraint.
- An end-to-end **request flow**, a **worker lifecycle** (warm weights, readiness/drain, version rollout, failure isolation), and **overload handling** via admission control / backpressure / graceful degradation.
- **Autoscaling on signals that actually predict SLO risk** — reasoning about why raw GPU utilization can mislead — and **observability** that can distinguish "slow because we wait to batch" from "slow because the model is slow."
- An explicit list of **trade-offs** and a sensible **"what I'd ship first"** that starts simple and specializes only when justified.
### Follow-up Questions
- How does your scheduler change for **autoregressive LLM generation**, and what becomes the binding capacity constraint once you adopt continuous batching?
- A single tenant suddenly sends a burst of very long-sequence requests that fill a length bucket no one else uses. Walk through what happens to other tenants' latency and how your design prevents starvation.
- A worker crashes mid-batch. Which requests are affected, what do you retry, and why is retrying safe (or not) here?
- The dashboard shows **low GPU utilization but rising p99 latency**. What is the likely cause, and what would you tune?
Quick Answer: This question evaluates a candidate's competency in ML System Design, focusing on GPU-based inference batching, scheduling, autoscaling, and the operational concerns of routing, compatibility, and fault-tolerance.