How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Onsite rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Design an LLM Request Batching System | Anthropic Interview Question

Q: Design an LLM Request Batching System

This question evaluates a candidate's ability to design a concurrent batching layer that groups individual requests into shared GPU inference calls under latency and size constraints. It tests system design skills around concurrency control, throughput-latency trade-offs, and scaling across multiple backend workers. Such questions are common in interviews to assess practical distributed-systems and multithreading design ability.

Design an LLM Request Batching System

A model-inference service sits in front of a fleet of GPUs. Each individual user request carries one input string and expects one output string, but the GPUs are far more efficient when many inputs are processed together in a single forward pass. A given GPU can only process one batch at a time. Your job is to design the component that sits between incoming requests and the GPU backend: it collects concurrent individual requests, groups them into batches, dispatches each batch to a GPU, and routes each result back to the caller that submitted it.

The public entry point is a single blocking call used by request-handling threads:

List<String> batchString(List<String> input)   // returns the model outputs, in the same order as input

In practice each caller submits its own request and blocks until its own result is ready, while the system internally merges many such calls into shared batches.

Constraints & Assumptions

A single GPU processes exactly one batch at a time; a batch is a list of inputs and returns a list of outputs of equal length, positionally aligned.
There is a maximum batch size B (GPU memory bound) and the system should not let a request wait longer than a latency budget T (e.g. tens of milliseconds) before being dispatched.
Hundreds to thousands of caller threads may invoke the entry point concurrently.
The backend exposes a function like List<String> runOnGpu(List<String> batch) that blocks until the batch completes. There are G GPUs available (start with G = 1 , then generalize).
Treat the model call itself as a black box; you are designing the batching, concurrency, and dispatch around it.

Clarifying Questions to Ask

What are the target p50/p99 latencies and the expected request rate, so we can pick B and T ?
Is per-request ordering of outputs required to match inputs exactly (yes — positional), and can different callers' requests be freely interleaved within a batch?
Are requests uniform in cost, or do input lengths vary enough that we should batch by token budget rather than by count?
What should happen to in-flight requests if a GPU call fails or times out — retry, fail the request, or re-queue?
Is there a backpressure/admission story when the system is overloaded (reject, queue with a bound, shed)?

Part 1 — Batch formation and the entry-point contract

Design the batchString call and the aggregation policy. Specify the in-memory buffer that accumulates pending requests, and the two triggers that close a batch and send it: reaching the maximum batch size B, or a pending request reaching the timeout T. Explain how a caller blocks until exactly its outputs are returned and how outputs are demultiplexed back to the right caller in the right order.

What This Part Should Cover Premium

Part 2 — Concurrency and thread-safety

Many threads call batchString simultaneously while a dispatcher drains the buffer. Make the shared state correct under concurrency. Identify the race conditions and say exactly what is protected by what.

What This Part Should Cover Premium

Part 3 — Dispatch architecture and scaling to many GPUs

Generalize from one GPU to G GPUs. Compare a gateway-push design (a coordinator assigns batches to GPUs) with a worker-pull design (idle GPU workers pull the next batch from a shared queue). Recommend one and justify it, then describe how the system scales and stays available.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

If inputs vary widely in length, how would you batch by total token budget instead of by request count, and how does that change the buffer and the flush triggers?
A single batch contains one pathological input that makes runOnGpu 10x slower. How do you keep that from inflating the latency of the other requests in the same batch next time?
How would you add fairness or priority (e.g. paid tier vs. free tier) without starving low-priority requests?
How do you observe this system in production — what metrics (batch fill ratio, queue depth, p99 wait-before-dispatch) would you alert on?

Design an LLM Request Batching System

The public entry point is a single blocking call used by request-handling threads:

List<String> batchString(List<String> input)   // returns the model outputs, in the same order as input

In practice each caller submits its own request and blocks until its own result is ready, while the system internally merges many such calls into shared batches.

Constraints & Assumptions

A single GPU processes exactly one batch at a time; a batch is a list of inputs and returns a list of outputs of equal length, positionally aligned.
There is a maximum batch size B (GPU memory bound) and the system should not let a request wait longer than a latency budget T (e.g. tens of milliseconds) before being dispatched.
Hundreds to thousands of caller threads may invoke the entry point concurrently.
The backend exposes a function like List<String> runOnGpu(List<String> batch) that blocks until the batch completes. There are G GPUs available (start with G = 1 , then generalize).
Treat the model call itself as a black box; you are designing the batching, concurrency, and dispatch around it.

Clarifying Questions to Ask

What are the target p50/p99 latencies and the expected request rate, so we can pick B and T ?
Is per-request ordering of outputs required to match inputs exactly (yes — positional), and can different callers' requests be freely interleaved within a batch?
Are requests uniform in cost, or do input lengths vary enough that we should batch by token budget rather than by count?
What should happen to in-flight requests if a GPU call fails or times out — retry, fail the request, or re-queue?
Is there a backpressure/admission story when the system is overloaded (reject, queue with a bound, shed)?

Part 1 — Batch formation and the entry-point contract

What This Part Should Cover Premium

Part 2 — Concurrency and thread-safety

What This Part Should Cover Premium

Part 3 — Dispatch architecture and scaling to many GPUs

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

If inputs vary widely in length, how would you batch by total token budget instead of by request count, and how does that change the buffer and the flush triggers?
A single batch contains one pathological input that makes runOnGpu 10x slower. How do you keep that from inflating the latency of the other requests in the same batch next time?
How would you add fairness or priority (e.g. paid tier vs. free tier) without starving low-priority requests?
How do you observe this system in production — what metrics (batch fill ratio, queue depth, p99 wait-before-dispatch) would you alert on?

Design an LLM Request Batching System

Quick Overview

Design an LLM Request Batching System

Design an LLM Request Batching System

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Batch formation and the entry-point contract

What This Part Should Cover Premium

Part 2 — Concurrency and thread-safety

What This Part Should Cover Premium

Part 3 — Dispatch architecture and scaling to many GPUs

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design an LLM Request Batching System

Quick Overview

Design an LLM Request Batching System

Design an LLM Request Batching System

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Batch formation and the entry-point contract

What This Part Should Cover Premium

Part 2 — Concurrency and thread-safety

What This Part Should Cover Premium

Part 3 — Dispatch architecture and scaling to many GPUs

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP