An Anthropic ML system design onsite question: design a scalable, GPU-backed inference API that serves multiple ML models and LLMs to product services with low-latency SLOs. It evaluates API and request-lifecycle design, independent CPU/GPU scaling, dynamic/continuous batching, KV-cache memory management, model versioning, autoscaling on heterogeneous GPUs, and the diagnostic skill to fix a CPU-idle/GPU-saturated bottleneck with metrics rather than guesswork.
##### Question
Design a scalable, GPU-backed inference API for serving multiple ML models (including large autoregressive models such as LLMs) to product services. The system must support low-latency online inference with clear SLOs, scale from a small deployment to high traffic, and serve multiple model versions and tenants. Walk through the architecture end to end and reason about bottlenecks with metrics rather than scaling every component blindly.
Discuss:
1. **Public API shape and request lifecycle.** What does the synchronous prediction endpoint look like (request/response fields, idempotency, tenant identity, model version selection)? When do you also need an async / job-based API and streaming responses?
2. **Core architecture and data flow.** API gateway, frontend/CPU validation and preprocessing, scheduler/queue, dynamic batching layer, GPU inference workers, model registry/artifact store, and control plane. Describe the request flow through these components.
3. **Independent scaling of CPU and GPU components.** Which signals drive CPU autoscaling vs GPU-pool autoscaling, and why are they decoupled?
4. **Diagnostic scenario:** what would you do if CPU utilization is low but the GPUs are saturated? Walk through how you confirm the bottleneck and the ordered set of actions you'd take.
5. **Dynamic / continuous batching and SLO-aware scheduling.** How do you form batches under latency deadlines, ensure per-tenant fairness, and apply queueing and backpressure?
6. **GPU memory management.** Weights residency, KV/paged-attention cache sizing, quantization, tensor parallelism, warm pools and eviction, and per-tenant isolation (e.g. MIG).
7. **Model versioning, A/B routing, canarying, and rollbacks.** How does the registry and router support traffic splits and safe rollout/rollback?
8. **Autoscaling across heterogeneous GPU nodes** (different GPU types, throughput curves, bin-packing, prewarming, spot/preemptible handling).
9. **Model loading and warmup**, including lazy adapter (LoRA) loading.
10. **Reliability, observability, capacity planning, rollout strategy, cost controls, and security** (retry semantics, latency breakdown metrics, per-tenant quotas/billing, supply-chain and tenant isolation).
Quick Answer: An Anthropic ML system design onsite question: design a scalable, GPU-backed inference API that serves multiple ML models and LLMs to product services with low-latency SLOs. It evaluates API and request-lifecycle design, independent CPU/GPU scaling, dynamic/continuous batching, KV-cache memory management, model versioning, autoscaling on heterogeneous GPUs, and the diagnostic skill to fix a CPU-idle/GPU-saturated bottleneck with metrics rather than guesswork.
Design a scalable, GPU-backed inference API that serves multiple ML models — including large autoregressive models such as LLMs — to internal product services. The system must support low-latency online inference against explicit SLOs, scale from a small deployment to high traffic, and serve multiple model versions and tenants concurrently.
The central skill being tested is reasoning about bottlenecks with metrics rather than scaling every component blindly — in particular, recognizing that the GPU is the scarce, expensive, slow-to-provision resource and that the CPU path and GPU path scale and fail independently. Walk through the architecture end to end and justify each scaling and remediation decision from a specific signal.
Constraints & Assumptions
Workload mix:
unary requests (classifiers/encoders) and long-running autoregressive generation (LLMs) coexist on the same platform.
Tenancy:
multiple tenants share the fleet; you must enforce per-tenant quotas, fairness, isolation, and billing.
Versioning:
several versions of each model are live at once for A/B testing, canarying, and instant rollback.
Resource asymmetry:
CPU capacity is cheap and provisions in seconds; GPU capacity is expensive and provisions in minutes (cold start dominated by weight load + kernel warmup).
SLOs are first-class:
assume per-route latency SLOs exist (e.g. a streaming TTFT target and a p95/p99 end-to-end target). State the exact targets you choose; scaling and admission decisions must reference them.
Assume an industry-standard inference runtime is available (vLLM / TensorRT-LLM / Triton or equivalent) — you do not need to implement attention kernels, but you should reason about what they buy you.
Clarifying Questions to Ask
What is the traffic profile — steady, diurnal, or spiky — and what is the ratio of unary to streaming/autoregressive requests?
What are the concrete latency SLOs per response mode (TTFT for streaming, p95/p99 end-to-end for unary), and what availability / error-budget target applies?
What is the model portfolio: how many distinct models, typical parameter counts, how many concurrent versions, and how many LoRA adapters per base model?
What GPU SKUs are available (e.g. A10 / A100 / H100), and may we mix on-demand and spot/preemptible capacity?
How many tenants, and what isolation guarantee is required between them (soft quotas vs hard hardware partitioning)?
Is the deployment single-region or multi-region, and are there data-residency constraints on tenant payloads?
Part 1 — Public API and request lifecycle
Define the public inference API. Specify the synchronous prediction endpoint (request/response fields, idempotency, tenant identity, model-version selection), then explain when a unary endpoint is insufficient and you need streaming and/or an async/job-based API instead.
What This Part Should Cover
A concrete endpoint with versioned path, request fields (request/idempotency id, model + optional pinned version, inputs, generation parameters) and response fields (resolved version echoed back, outputs,
usage
for billing, latency).
Tenant identity sourced from auth (not the request body) and used for authZ, quota, fairness, and billing.
Clear triggers for streaming (TTFT SLO, mid-stream flow control, clean cancel on disconnect) vs async/job API (work exceeding the sync timeout, bulk scoring, off-peak scheduling).
Part 2 — Core architecture and request flow
Lay out the components — API gateway, CPU frontend (validation/preprocessing), scheduler/queue, dynamic-batching layer, GPU inference workers, model registry/artifact store, and control plane — and trace a single request through them end to end, including the admission-control decision points.
What This Part Should Cover
Each component's responsibility, including which are stateless/CPU vs weights-resident/GPU.
The request flow: gateway (authN/Z, rate-limit, quota) → CPU validate/preprocess + version resolution + deadline computation → scheduler/queue → batcher → GPU worker → postprocess → respond, with metrics/trace/usage emitted.
Where admission control sits and what it does when a queue is too deep (shed, degrade, or admit with a deadline).
The registry as source of truth for versioned,
signed/checksummed
artifacts (weights, configs, tokenizer, adapters, rollout state).
Part 3 — Independent scaling of CPU and GPU, and the bottleneck diagnosis
First explain which signals drive CPU autoscaling vs GPU-pool autoscaling and why the two are deliberately decoupled.
Then work the diagnostic scenario: CPU utilization is low but the GPUs are saturated. Walk through how you confirm the bottleneck and the ordered set of actions you would take — not a grab-bag, but a sequence that starts cheap and escalates.
What This Part Should Cover
The contrasting scale signals: CPU on request rate / preprocess latency / pre-GPU queue depth; GPU on
queue-wait, queue depth, GPU memory util, batch fill ratio, and p95/TTFT
— not raw CPU%.
Why decoupling is necessary (cost asymmetry, slow GPU provisioning, statefulness) and the role of warm pools / scheduled scaling for GPU cold-start.
A confirm-first diagnosis using a per-stage latency breakdown to prove requests wait on the GPU, not pre-GPU.
Part 4 — Dynamic / continuous batching and SLO-aware scheduling
Explain how you form batches under latency deadlines, how continuous batching differs for autoregressive models, how you guarantee per-tenant fairness, and how queueing and backpressure work.
What This Part Should Cover
A batch-trigger rule (size cap
or
wait cap
or
deadline) and an SLO budget that derives
Wmax
from total SLO minus compute time minus margin.
Continuous/in-flight batching for LLMs: admit/evict at token-step boundaries; the distinct prefill (compute-bound) vs decode (memory-bandwidth-bound) phases and why batching amortizes the per-step weight read.
Per-tenant fairness via weighted fair queuing and per-class queues to prevent head-of-line blocking; backpressure via 429 + Retry-After at the edge and worker→scheduler capacity hints.
Part 5 — GPU memory management
Account for what occupies GPU HBM and how you manage it: weights residency, KV / paged-attention cache sizing, quantization, tensor parallelism, warm pools and eviction, and per-tenant isolation (e.g. MIG).
What This Part Should Cover
Sizing weights (
≈
params
×
bytes/param) and KV cache per concurrent sequence, and recognizing KV cache as often the binding constraint.
Quantization (fit larger models / more KV), paged attention / KV paging (fragmentation + oversubscription bounds), and grouped-query attention's effect on KV size.
Tensor/sequence parallelism for models too large for one GPU; warm pools with LRU eviction + minimum-residency hysteresis; sticky placement to nodes holding resident weights.
Tenant isolation: MIG hard-partitioning vs guarded time-slicing.
Part 6 — Versioning, canary, heterogeneous autoscaling, and warmup
Cover the rollout and capacity machinery: (a) model versioning, A/B routing, canarying, and rollback via the registry and router; (b) autoscaling across heterogeneous GPU node types (different throughput curves, bin-packing, prewarming, spot/preemptible handling); and (c) model loading and warmup, including lazy LoRA adapter loading.
What This Part Should Cover
Registry as source of truth; pinned-version resolution; traffic splits (tenant-scoped or sticky-by-user); canary ramp behind SLO/error/cost guardrails with statistical comparison and auto-rollback (drain + revert routing, keep prior version warm).
Per-GPU-type node groups with independently measured throughput curves; scale signals (per-class queue depth, deadline-miss rate, TTFT p95, fill ratio targeting ~60–85% util); bin-packing by VRAM + throughput; prewarming; spot/preemptible for non-critical pools with checkpoint + drain-on-preemption.
Load/warmup sequence: verify signed artifact → stage to HBM → compile/autotune → warmup over representative shapes → health-check before routing; lazy LoRA loading sharing base weights with LRU eviction.
Part 7 — Reliability, observability, capacity planning, cost, and security
Address the cross-cutting production concerns: retry semantics by failure point, the latency-breakdown metrics that make diagnosis possible, per-tenant quotas/billing, capacity planning, cost controls, and security (supply-chain integrity and tenant isolation).
What This Part Should Cover
Latency as a per-stage
breakdown
(gateway → preprocess → queue-wait → batch-wait → GPU execution → postprocess) — the same breakdown that powers the Part 3 diagnosis — plus GPU util/mem, fill ratio, tokens/s, error/timeout rate, KV/prefix-cache hit rate, per-tenant usage; multi-window burn-rate alerts on TTFT/p95/availability.
Retry semantics partitioned by failure point (pre-execution retryable; mid-stream only if checkpointed, else fail + bill partial; node death → drain + re-route); timeouts, circuit breakers, inference-level health checks, graceful drain.
Capacity planning from per-GPU-type throughput curves × historical demand; cost controls (precision by tier, warm-pool right-sizing with draining, spot/on-demand mix, per-tenant spend caps).
Security: API keys/OIDC + mTLS, per-tenant scopes, encryption in transit/at rest, network-policy + MIG/cgroup isolation,
signed containers and model artifacts
with SBOM/supply-chain checks.
What a Strong Answer Covers
Across all parts, a strong answer is distinguished less by listing components than by the discipline of tying every decision to a metric. Beyond the per-Part dimensions above, look for:
SLOs pinned up front
and then used as the yardstick for batching wait-time, autoscaling targets, admission control, and canary guardrails — not bolted on at the end.
The GPU treated as the scarce resource
the whole design optimizes around, with the CPU path kept elastic enough to never be the bottleneck.
Correct LLM-specific reasoning
throughout: streaming/TTFT, continuous batching, prefill-vs-decode bottleneck asymmetry, KV-cache-as-capacity-limit, paged attention — treated as first-class constraints, not afterthoughts.
Metric-driven remediation
in the diagnostic scenario (confirm before acting, escalate cheapest-first) rather than reflexively adding hardware.
Multi-tenancy carried end to end
— identity, quota, fairness, isolation, and billing appearing consistently across API, scheduling, memory, and security.
Follow-up Questions
Your TTFT p95 is healthy but inter-token latency degrades badly under load. Given the prefill-vs-decode distinction in Part 4, where do you look first, and which knob do you turn?
A single tenant's bursty traffic is starving everyone else despite per-tenant rate limits at the gateway. Why might gateway rate limits be insufficient, and what in the scheduler (Part 4) and memory layer (Part 5) actually enforces fairness?
A canary at 5% shows equal p95 latency but a 3% higher token cost per request. Walk through how your guardrails (Part 6) should treat a cost regression with no latency regression.
Spot GPUs in a non-critical pool are being reclaimed mid-generation. How do checkpointing and drain-on-preemption (Parts 5–7) bound the user-visible impact, and what do you bill?