ML Inference APIs And GPU Batching
Asked of: Software Engineer
Last updated

What's being tested
These interviews test whether you can design a GPU-backed inference service that meets real latency, throughput, reliability, and cost constraints under multi-tenant load. The interviewer is probing for distributed systems judgment: queueing, batching, routing, autoscaling, failure isolation, observability, and API semantics. For Anthropic, this matters because inference infrastructure sits directly on the product path: poor batching wastes expensive accelerators, poor isolation hurts customers, and poor overload behavior can take down shared capacity. A strong Software Engineer answer should stay at the serving-platform layer, not drift into model architecture or training methodology.
Core knowledge
-
Latency SLOs must be decomposed across the request path: client edge, auth/rate limit, routing, queue wait, GPU execution, post-processing, and streaming. Track
p50,p95,p99, timeout rate, and queue wait separately; an aggregatep99hides whether the bottleneck is scheduling or compute. -
Dynamic batching groups requests arriving within a short window to improve GPU utilization. The key knobs are max batch size, max batch delay, token budget, and compatible model/version. Larger batches increase throughput but add queueing delay, so the policy must be tied to an SLO like “
p95first-token latency < 500 ms.” -
Queueing theory gives the basic danger signal: as utilization approaches 1, queueing latency grows nonlinearly. In practice, keep serving pools below roughly 60–80% sustained utilization if
p99latency matters, because burstiness and long prompts create tail amplification. -
Continuous batching for autoregressive LLM serving differs from simple request batching. New requests can join while existing requests are decoding, and finished sequences leave the batch. Systems like
`vLLM`use paged attention to reduce KV-cache fragmentation and improve memory utilization during long-running generation. -
Prefill vs decode are different workload phases. Prefill processes the input prompt in parallel and is compute-heavy; decode generates one or a few tokens per step and is often memory-bandwidth or scheduling-sensitive. A good design may route, batch, and measure these phases separately.
-
Admission control protects the service before it collapses. Use per-tenant rate limits, max in-flight requests, max prompt length, max output tokens, and queue deadlines. If estimated work exceeds capacity, return
429or503early rather than accepting work that will time out in the queue. -
Routing should consider model ID, model version, tenant tier, region, hardware type, current queue depth, GPU memory availability, and request shape. A basic design uses a control plane for fleet state and a data-plane router using least-loaded or weighted routing with health checks.
-
Multi-tenancy isolation requires fairness, not just authentication. Common approaches include per-tenant queues, weighted fair queueing, token-bucket rate limits, reserved capacity for high-priority tenants, and noisy-neighbor detection. Without this, one customer with long prompts can consume KV cache and degrade everyone else’s
p99. -
GPU memory management is often the hard limit. Model weights, activation buffers, and KV cache compete for memory. For LLMs, KV cache grows roughly with batch size × sequence length × layers × hidden dimension, so “batch more” can trigger out-of-memory failures unless capped by a token budget.
-
Failure handling should distinguish retryable and non-retryable failures. Router or worker crashes can be retried if the request is idempotent; partial streamed responses usually cannot be transparently retried without client-visible semantics. Use deadlines, cancellation propagation, circuit breakers, and draining for deploys.
-
Autoscaling should use workload-aware signals, not just CPU. Better signals include queue depth by model, queue age, GPU utilization, tokens/sec, batch fullness, KV-cache pressure, and SLO burn rate. Scale-up is slow for GPU nodes, so keep warm capacity or predictive buffers for known traffic spikes.
-
Observability needs cardinality discipline and request-shape breakdowns. Emit metrics for
time_to_first_token,tokens_per_second,queue_wait_ms,batch_size,prompt_tokens,completion_tokens, OOM count, retry count, and per-tenant throttling. Logs and traces should include request IDs but avoid storing sensitive prompt text by default.
Worked example
For “Design GPU inference request batching,” start by clarifying the workload: “Are these LLM text generation requests, embeddings, or classification? Do we optimize for time-to-first-token, total completion latency, throughput, or cost? Are requests streamed, and do tenants have different SLOs?” Then declare assumptions: a shared fleet serves multiple model versions, requests have variable prompt and output lengths, and the service has a strict p95 latency target.
Organize the answer around four pillars: request ingress and validation, batching scheduler, GPU worker execution, and observability/autoscaling. At ingress, describe auth, tenant rate limits, request deadlines, token limits, and routing by model/version. In the scheduler, explain per-model queues, compatibility constraints, max batch delay, max batch size, and token-budget-based batching rather than only count-based batching. In the worker, mention loading model weights, managing KV cache, streaming partial outputs, handling cancellation, and returning structured errors.
A strong tradeoff to flag is throughput versus tail latency: waiting 20 ms to build a fuller batch may improve GPU utilization materially, but doing so for an interactive tenant can violate first-token SLOs. You can propose separate classes, such as “interactive” with small max delay and “batch/offline” with larger delay and lower priority. Close by saying that, with more time, you would detail load testing methodology, failure injection, and cost controls such as model placement and warm-pool sizing.
A second angle
For “Design a prompt processing backend,” the same serving concepts apply, but the emphasis shifts toward asynchronous job orchestration and durable state. Instead of optimizing only for interactive latency, you may need an API that accepts a job, returns a job ID, supports idempotent submission, and lets clients poll or receive callbacks. Batching still matters at the GPU layer, but the frontend design also needs job state transitions such as queued, running, succeeded, failed, and cancelled. Retries and dead-letter handling become more central because a background job can survive client disconnects, unlike a purely synchronous inference call.
Common pitfalls
Pitfall: Designing batching as “collect N requests, run them, repeat.”
That answer misses variable sequence lengths, deadlines, tenant priority, and GPU memory limits. A better answer says batches are formed by compatibility and token budget, constrained by max wait time and per-request deadlines.
Pitfall: Talking only about GPU utilization and ignoring user-visible latency.
High utilization is not the product goal; it is a cost-efficiency goal under an SLO. Interviewers expect you to reason about p95/p99, queue wait, time-to-first-token, overload behavior, and what happens when the system is near saturation.
Pitfall: Hand-waving multi-tenancy as “add rate limiting.”
Rate limits are necessary but insufficient. You should also discuss per-tenant queues, weighted fairness, reserved capacity, priority classes, request size limits, and metrics that prove one tenant cannot degrade another tenant’s latency.
Connections
The interviewer may pivot from inference APIs into load balancing, distributed rate limiting, autoscaling, streaming API design, or idempotent job processing. They may also probe adjacent ML-serving concepts such as model rollout, canarying, shadow traffic, and per-version observability, but a Software Engineer should frame these as platform reliability and deployment concerns.
Further reading
-
The Tail at Scale — foundational paper on why
p99latency dominates large-scale user-facing systems. -
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — useful for understanding continuous batching and KV-cache memory pressure in LLM serving.
-
SRE Book: Handling Overload — practical patterns for admission control, load shedding, and graceful degradation.
Featured in interview prep guides
Practice questions
- Design Model Weight DistributionAnthropic · Software Engineer · Onsite · medium
- Design GPU inference request batchingAnthropic · Software Engineer · Onsite · none
- Design a batch inference APIAnthropic · Software Engineer · Onsite · hard
- Design an LLM-based binary classifierAnthropic · Software Engineer · Technical Screen · medium
- Review an inference API design for scaleAnthropic · Software Engineer · Onsite · hard
- Design a low-latency ML inference APIAnthropic · Software Engineer · Onsite · hard
- Design a GPU inference APIAnthropic · Software Engineer · Onsite · hard
- Design a prompt processing backendAnthropic · Software Engineer · Onsite · hard
Related concepts
- ML System Design, Recommenders, Forecasting And AllocationMachine Learning
- Production ML Pipelines And System DesignML System Design
- ML Frameworks, Model Compilation, And ParallelismML System Design
- LLM Inference Serving, Batching, And KV Cache
- Machine Learning System Design For Real-Time DecisionsMachine Learning
- Supervised ML Workflows, Interpretability And DeploymentMachine Learning