ML Inference APIs And GPU Batching

What's being tested

These interviews test whether you can design a GPU-backed inference service that meets real latency, throughput, reliability, and cost constraints under multi-tenant load. The interviewer is probing for distributed systems judgment: queueing, batching, routing, autoscaling, failure isolation, observability, and API semantics. For Anthropic, this matters because inference infrastructure sits directly on the product path: poor batching wastes expensive accelerators, poor isolation hurts customers, and poor overload behavior can take down shared capacity. A strong Software Engineer answer should stay at the serving-platform layer, not drift into model architecture or training methodology.

Core knowledge

Latency SLOs must be decomposed across the request path: client edge, auth/rate limit, routing, queue wait, GPU execution, post-processing, and streaming. Track p50, p95, p99, timeout rate, and queue wait separately; an aggregate p99 hides whether the bottleneck is scheduling or compute.
Dynamic batching groups requests arriving within a short window to improve GPU utilization. The key knobs are max batch size, max batch delay, token budget, and compatible model/version. Larger batches increase throughput but add queueing delay, so the policy must be tied to an SLO like “p95 first-token latency < 500 ms.”
Queueing theory gives the basic danger signal: as utilization $\rho = \lambda / \mu$ approaches 1, queueing latency grows nonlinearly. In practice, keep serving pools below roughly 60–80% sustained utilization if p99 latency matters, because burstiness and long prompts create tail amplification.
Continuous batching for autoregressive LLM serving differs from simple request batching. New requests can join while existing requests are decoding, and finished sequences leave the batch. Systems like `vLLM` use paged attention to reduce KV-cache fragmentation and improve memory utilization during long-running generation.
Prefill vs decode are different workload phases. Prefill processes the input prompt in parallel and is compute-heavy; decode generates one or a few tokens per step and is often memory-bandwidth or scheduling-sensitive. A good design may route, batch, and measure these phases separately.
Admission control protects the service before it collapses. Use per-tenant rate limits, max in-flight requests, max prompt length, max output tokens, and queue deadlines. If estimated work exceeds capacity, return 429 or 503 early rather than accepting work that will time out in the queue.
Routing should consider model ID, model version, tenant tier, region, hardware type, current queue depth, GPU memory availability, and request shape. A basic design uses a control plane for fleet state and a data-plane router using least-loaded or weighted routing with health checks.
Multi-tenancy isolation requires fairness, not just authentication. Common approaches include per-tenant queues, weighted fair queueing, token-bucket rate limits, reserved capacity for high-priority tenants, and noisy-neighbor detection. Without this, one customer with long prompts can consume KV cache and degrade everyone else’s p99.
GPU memory management is often the hard limit. Model weights, activation buffers, and KV cache compete for memory. For LLMs, KV cache grows roughly with batch size × sequence length × layers × hidden dimension, so “batch more” can trigger out-of-memory failures unless capped by a token budget.
Failure handling should distinguish retryable and non-retryable failures. Router or worker crashes can be retried if the request is idempotent; partial streamed responses usually cannot be transparently retried without client-visible semantics. Use deadlines, cancellation propagation, circuit breakers, and draining for deploys.
Autoscaling should use workload-aware signals, not just CPU. Better signals include queue depth by model, queue age, GPU utilization, tokens/sec, batch fullness, KV-cache pressure, and SLO burn rate. Scale-up is slow for GPU nodes, so keep warm capacity or predictive buffers for known traffic spikes.
Observability needs cardinality discipline and request-shape breakdowns. Emit metrics for time_to_first_token, tokens_per_second, queue_wait_ms, batch_size, prompt_tokens, completion_tokens, OOM count, retry count, and per-tenant throttling. Logs and traces should include request IDs but avoid storing sensitive prompt text by default.

Worked example

For “Design GPU inference request batching,” start by clarifying the workload: “Are these LLM text generation requests, embeddings, or classification? Do we optimize for time-to-first-token, total completion latency, throughput, or cost? Are requests streamed, and do tenants have different SLOs?” Then declare assumptions: a shared fleet serves multiple model versions, requests have variable prompt and output lengths, and the service has a strict p95 latency target.

Organize the answer around four pillars: request ingress and validation, batching scheduler, GPU worker execution, and observability/autoscaling. At ingress, describe auth, tenant rate limits, request deadlines, token limits, and routing by model/version. In the scheduler, explain per-model queues, compatibility constraints, max batch delay, max batch size, and token-budget-based batching rather than only count-based batching. In the worker, mention loading model weights, managing KV cache, streaming partial outputs, handling cancellation, and returning structured errors.

A strong tradeoff to flag is throughput versus tail latency: waiting 20 ms to build a fuller batch may improve GPU utilization materially, but doing so for an interactive tenant can violate first-token SLOs. You can propose separate classes, such as “interactive” with small max delay and “batch/offline” with larger delay and lower priority. Close by saying that, with more time, you would detail load testing methodology, failure injection, and cost controls such as model placement and warm-pool sizing.

A second angle

For “Design a prompt processing backend,” the same serving concepts apply, but the emphasis shifts toward asynchronous job orchestration and durable state. Instead of optimizing only for interactive latency, you may need an API that accepts a job, returns a job ID, supports idempotent submission, and lets clients poll or receive callbacks. Batching still matters at the GPU layer, but the frontend design also needs job state transitions such as queued, running, succeeded, failed, and cancelled. Retries and dead-letter handling become more central because a background job can survive client disconnects, unlike a purely synchronous inference call.

Common pitfalls

Pitfall: Designing batching as “collect N requests, run them, repeat.”

That answer misses variable sequence lengths, deadlines, tenant priority, and GPU memory limits. A better answer says batches are formed by compatibility and token budget, constrained by max wait time and per-request deadlines.

Pitfall: Talking only about GPU utilization and ignoring user-visible latency.

High utilization is not the product goal; it is a cost-efficiency goal under an SLO. Interviewers expect you to reason about p95/p99, queue wait, time-to-first-token, overload behavior, and what happens when the system is near saturation.

Pitfall: Hand-waving multi-tenancy as “add rate limiting.”

Rate limits are necessary but insufficient. You should also discuss per-tenant queues, weighted fairness, reserved capacity, priority classes, request size limits, and metrics that prove one tenant cannot degrade another tenant’s latency.

Connections

The interviewer may pivot from inference APIs into load balancing, distributed rate limiting, autoscaling, streaming API design, or idempotent job processing. They may also probe adjacent ML-serving concepts such as model rollout, canarying, shadow traffic, and per-version observability, but a Software Engineer should frame these as platform reliability and deployment concerns.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts