This question evaluates a candidate's ability to design a concurrent batching layer that groups individual requests into shared GPU inference calls under latency and size constraints. It tests system design skills around concurrency control, throughput-latency trade-offs, and scaling across multiple backend workers. Such questions are common in interviews to assess practical distributed-systems and multithreading design ability.
# Design an LLM Request Batching System
A model-inference service sits in front of a fleet of GPUs. Each individual user request carries one input string and expects one output string, but the GPUs are far more efficient when many inputs are processed together in a single forward pass. **A given GPU can only process one batch at a time.** Your job is to design the component that sits between incoming requests and the GPU backend: it collects concurrent individual requests, groups them into batches, dispatches each batch to a GPU, and routes each result back to the caller that submitted it.
The public entry point is a single blocking call used by request-handling threads:
```
List<String> batchString(List<String> input) // returns the model outputs, in the same order as input
```
In practice each caller submits its own request and blocks until its own result is ready, while the system internally merges many such calls into shared batches.
### Constraints & Assumptions
- A single GPU processes exactly one batch at a time; a batch is a list of inputs and returns a list of outputs of equal length, positionally aligned.
- There is a maximum batch size `B` (GPU memory bound) and the system should not let a request wait longer than a latency budget `T` (e.g. tens of milliseconds) before being dispatched.
- Hundreds to thousands of caller threads may invoke the entry point concurrently.
- The backend exposes a function like `List<String> runOnGpu(List<String> batch)` that blocks until the batch completes. There are `G` GPUs available (start with `G = 1`, then generalize).
- Treat the model call itself as a black box; you are designing the batching, concurrency, and dispatch around it.
### Clarifying Questions to Ask
- What are the target p50/p99 latencies and the expected request rate, so we can pick `B` and `T`?
- Is per-request ordering of outputs required to match inputs exactly (yes — positional), and can different callers' requests be freely interleaved within a batch?
- Are requests uniform in cost, or do input lengths vary enough that we should batch by token budget rather than by count?
- What should happen to in-flight requests if a GPU call fails or times out — retry, fail the request, or re-queue?
- Is there a backpressure/admission story when the system is overloaded (reject, queue with a bound, shed)?
### Part 1 — Batch formation and the entry-point contract
Design the `batchString` call and the aggregation policy. Specify the in-memory buffer that accumulates pending requests, and the two triggers that close a batch and send it: reaching the maximum batch size `B`, or a pending request reaching the timeout `T`. Explain how a caller blocks until exactly its outputs are returned and how outputs are demultiplexed back to the right caller in the right order.
```hint Where to start
Model each call as enqueuing a small "request" object (its inputs plus a completion handle such as a future/promise/condition variable) into a shared buffer, then blocking on that handle until a dispatcher fills it in.
```
```hint Two flush triggers
A batch flushes when it hits size `B` (capacity trigger) or when the oldest pending request has waited `T` (a timer/deadline trigger). Track per-request slot indices so you can scatter the returned list back to each caller positionally.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Concurrency and thread-safety
Many threads call `batchString` simultaneously while a dispatcher drains the buffer. Make the shared state correct under concurrency. Identify the race conditions and say exactly what is protected by what.
```hint Guard the buffer
The pending buffer and the batch-currently-being-assembled are the shared mutable state; protect them with a mutex held only briefly (enqueue + maybe-close-batch), never across the long GPU call.
```
```hint Don't hold the lock during inference
Swap the closed batch out under the lock, release the lock, then run `runOnGpu`. Otherwise one in-flight batch serializes all enqueues and kills throughput.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Dispatch architecture and scaling to many GPUs
Generalize from one GPU to `G` GPUs. Compare a **gateway-push** design (a coordinator assigns batches to GPUs) with a **worker-pull** design (idle GPU workers pull the next batch from a shared queue). Recommend one and justify it, then describe how the system scales and stays available.
```hint Pull beats push here
With one-batch-at-a-time GPUs, let each worker pull when it becomes free. Pull naturally load-balances to whoever is idle and avoids the coordinator having to track each GPU's busy/free state and head-of-line blocking.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- If inputs vary widely in length, how would you batch by total token budget instead of by request count, and how does that change the buffer and the flush triggers?
- A single batch contains one pathological input that makes `runOnGpu` 10x slower. How do you keep that from inflating the latency of the other requests in the same batch next time?
- How would you add fairness or priority (e.g. paid tier vs. free tier) without starving low-priority requests?
- How do you observe this system in production — what metrics (batch fill ratio, queue depth, p99 wait-before-dispatch) would you alert on?
Quick Answer: This question evaluates a candidate's ability to design a concurrent batching layer that groups individual requests into shared GPU inference calls under latency and size constraints. It tests system design skills around concurrency control, throughput-latency trade-offs, and scaling across multiple backend workers. Such questions are common in interviews to assess practical distributed-systems and multithreading design ability.
A model-inference service sits in front of a fleet of GPUs. Each individual user request carries one input string and expects one output string, but the GPUs are far more efficient when many inputs are processed together in a single forward pass. A given GPU can only process one batch at a time. Your job is to design the component that sits between incoming requests and the GPU backend: it collects concurrent individual requests, groups them into batches, dispatches each batch to a GPU, and routes each result back to the caller that submitted it.
The public entry point is a single blocking call used by request-handling threads:
List<String> batchString(List<String> input) // returns the model outputs, in the same order as input
In practice each caller submits its own request and blocks until its own result is ready, while the system internally merges many such calls into shared batches.
Constraints & Assumptions
A single GPU processes exactly one batch at a time; a batch is a list of inputs and returns a list of outputs of equal length, positionally aligned.
There is a maximum batch size
B
(GPU memory bound) and the system should not let a request wait longer than a latency budget
T
(e.g. tens of milliseconds) before being dispatched.
Hundreds to thousands of caller threads may invoke the entry point concurrently.
The backend exposes a function like
List<String> runOnGpu(List<String> batch)
that blocks until the batch completes. There are
G
GPUs available (start with
G = 1
, then generalize).
Treat the model call itself as a black box; you are designing the batching, concurrency, and dispatch around it.
Clarifying Questions to Ask
What are the target p50/p99 latencies and the expected request rate, so we can pick
B
and
T
?
Is per-request ordering of outputs required to match inputs exactly (yes — positional), and can different callers' requests be freely interleaved within a batch?
Are requests uniform in cost, or do input lengths vary enough that we should batch by token budget rather than by count?
What should happen to in-flight requests if a GPU call fails or times out — retry, fail the request, or re-queue?
Is there a backpressure/admission story when the system is overloaded (reject, queue with a bound, shed)?
Part 1 — Batch formation and the entry-point contract
Design the batchString call and the aggregation policy. Specify the in-memory buffer that accumulates pending requests, and the two triggers that close a batch and send it: reaching the maximum batch size B, or a pending request reaching the timeout T. Explain how a caller blocks until exactly its outputs are returned and how outputs are demultiplexed back to the right caller in the right order.
What This Part Should Cover Premium
Part 2 — Concurrency and thread-safety
Many threads call batchString simultaneously while a dispatcher drains the buffer. Make the shared state correct under concurrency. Identify the race conditions and say exactly what is protected by what.
What This Part Should Cover Premium
Part 3 — Dispatch architecture and scaling to many GPUs
Generalize from one GPU to G GPUs. Compare a gateway-push design (a coordinator assigns batches to GPUs) with a worker-pull design (idle GPU workers pull the next batch from a shared queue). Recommend one and justify it, then describe how the system scales and stays available.
What This Part Should Cover Premium
What a Strong Answer Covers Premium
Follow-up Questions
If inputs vary widely in length, how would you batch by total token budget instead of by request count, and how does that change the buffer and the flush triggers?
A single batch contains one pathological input that makes
runOnGpu
10x slower. How do you keep that from inflating the latency of the other requests in the same batch next time?
How would you add fairness or priority (e.g. paid tier vs. free tier) without starving low-priority requests?
How do you observe this system in production — what metrics (batch fill ratio, queue depth, p99 wait-before-dispatch) would you alert on?