This question tests a candidate's ability to critically evaluate ML system designs for production-scale inference APIs, covering multi-tenancy, GPU resource constraints, and streaming token delivery. It assesses architectural reasoning across reliability, latency trade-offs, and capacity planning in the ML Systems domain.
You are reviewing another engineer’s design doc for a machine-learning inference API. Critique and improve it with a focus on distributed systems: clarify product and latency/availability SLOs; estimate throughput and capacity; propose autoscaling, batching, and GPU/accelerator scheduling; handle model loading, versioning, and rollback; design multi-tenant isolation and rate limiting; prevent overload with backpressure, queues, and circuit breakers; define idempotency, retries, and timeouts; mitigate cold starts; specify caching strategy (weights, tokens) and token streaming; plan traffic shaping (canary, A/B), shadowing, and safe rollback; define monitoring, alerting, and error budgets; address privacy, safety filters, audit logs, and cost controls. Provide a high-level architecture and call out key trade-offs.
Quick Answer: This question tests a candidate's ability to critically evaluate ML system designs for production-scale inference APIs, covering multi-tenancy, GPU resource constraints, and streaming token delivery. It assesses architectural reasoning across reliability, latency trade-offs, and capacity planning in the ML Systems domain.
System Design Review: A Machine-Learning Inference API at Scale
Background
You are reviewing a teammate's design document for a production machine-learning inference API that serves text-generation models (e.g., chat/completions) with token streaming. The service is multi-tenant and must run across multiple availability zones (AZs) on GPUs/accelerators.
Assume typical LLM workloads — a prompt prefill phase followed by token-by-token decode — with dynamic batching and a mix of small ("fast") and large ("quality") model SKUs. The system must support safe model rollouts, strong SLOs, and cost controls.
This is a design-review exercise, not a greenfield design. Your job is to critique the document and propose concrete improvements: separate what is right from what is missing or wrong, and push the design to a production bar. Work through the Parts below. Lead with the few issues that change the architecture; do not nitpick formatting.
Constraints & Assumptions
Multi-tenant, multi-AZ, GPU/accelerator-backed; mix of small and large model SKUs.
LLM serving physics apply: requests have a
prefill
phase (process the prompt) and a
decode
phase (generate output token by token); these consume different resources.
Latency is perceived as two numbers, not one:
time-to-first-token (TTFT)
and
inter-token latency (ITL)
.
Concurrency is bounded by
KV-cache memory
on each accelerator, which grows with batch size and context length.
Treat absolute SLO numbers, per-GPU token rates, and capacity figures as quantities you would
calibrate from load tests / benchmarks
, not memorized constants. State the
shape
of the reasoning and the formulas; pick illustrative numbers only to demonstrate the method.
Clarifying Questions to Ask
Scope the whole review before critiquing any single area:
What is the
traffic profile
— peak QPS, the distribution of prompt-token and output-token lengths, and the small-vs-large SKU mix? (Tail length, not the mean, drives KV memory and tail latency.)
What
SLO tier(s)
exist (e.g., interactive low-latency vs. async batch), and is there an explicit availability target and error budget already?
Which
endpoints
are in scope — streaming generations only, or also non-streaming, embeddings, and async batch?
What are the
isolation and compliance
requirements between tenants (data residency, retention, "no training on customer data" defaults)?
What hardware is assumed (accelerator type, HBM per device), and is
MIG / partitioning
available?
How fast must
rollback
of a bad model version be, and what is the acceptable
blast radius
of a rollout?
Part 1 — Product Scope, APIs, and SLOs
Critique how the doc defines its surface area and its latency/availability SLOs, then propose improvements. Pin down which endpoints exist and their distinct latency profiles. Define latency SLOs appropriate to streaming generation, and an availability SLO with an explicit error budget.
What This Part Should Cover
Splitting latency into
TTFT
and
ITL
(per-token) SLOs, set
per model tier
, rather than one blended P99.
A correct
availability
definition (success excluding intentional 4xx) with a concrete
error budget
(e.g., 99.9% ⇒ ~43 min/month) tied to rollout gates and alerts.
Distinguishing endpoint profiles (streaming generations vs. embeddings vs. async batch) and a
streaming-specific
SLI such as stream-completion rate.
Part 2 — Throughput and Capacity Planning
The doc estimates capacity from per-request latency and QPS. Critique that approach and produce a correct capacity model. Estimate the GPUs needed given model characteristics (prefill and decode tokens/s per GPU) and average request sizes, then add headroom and regional redundancy.
What This Part Should Cover
A
tokens/sec
capacity model that sizes prefill and decode independently (e.g.,
RPSprefill=Tprefill/L
,
RPSdecode=Tdecode/O
) and takes the binding constraint.
Distinguishing
per-GPU batch utilization
from
fleet utilization
(headroom for bursts/jitter/autoscale lag) without double-counting, plus zone redundancy (survive losing one of
z
AZs).
Recognizing the
KV-cache memory
ceiling on concurrency and naming a mitigation (paged/block KV).
Part 3 — Autoscaling, Batching, and Accelerator Scheduling
Critique the autoscaling signal, the batching policy, and the accelerator scheduling plan; propose improvements for each. Define the scaling signals, the dynamic batching window/policy, and how GPUs are scheduled (partitioning, packing, preemption, warm pools).
What This Part Should Cover
Replacing GPU-utilization scaling with
queue-/token-aware
signals, with warm-pool promotion and asymmetric (fast-out, slow-in) cooldowns.
Continuous batching
with an adaptive micro-batch admission window and length-aware grouping; separating prefill from decode scheduling (chunked prefill).
Accelerator scheduling:
MIG vs. full-GPU
trade-off by tier,
bin-pack by KV footprint
(not request count), and preemption of low-priority work.
Part 4 — Model Loading, Versioning, Rollout, and Rollback
Critique how the doc handles model versions, deployment, traffic shaping, and rollback; propose improvements. Cover an immutable model registry, preload/warm mechanisms, safe rolling updates with canary/A-B/shadow traffic, blast-radius limits, and fast rollback.
What This Part Should Cover
Immutable, content-addressed
versions whose manifest pins weights
and
tokenizer, sampling defaults, and safety policy.
Canary/A-B/shadow with
sticky
assignment, gating on TTFT/error-rate
and cost
($/1k tokens), with blast-radius limits.
Sub-minute rollback
via a warm prior version wired to the burn-rate alarm; shadow traffic isolated so it can't steal production capacity or reach users.
Part 5 — Multi-Tenant Isolation and Rate Limiting
Critique the multi-tenant story and improve it. Define per-tenant quotas, concurrency caps, fair queuing, and isolation across compute, memory, and network.
What This Part Should Cover
Token-based quotas (TPM/RPM, concurrency caps, max context/output) enforced at the
edge and again at admission
(so retries can't bypass).
Weighted-fair queuing
with priority classes by plan.
Isolation tiers across
compute
(dedicated/MIG vs. packed),
memory
(per-tenant KV budget with degrade-on-overflow), and
network
(per-stream egress fairness).
Part 6 — Overload Protection and Resilience
The doc has weak overload handling. Critique it and design "shed early, shed cheaply." Cover admission control, bounded queues with TTLs, backpressure, circuit breakers, graceful degradation, and a load-shedding priority order.
What This Part Should Cover
Bounded per-tenant queues + queue TTL
with early
429
rejection and deadline propagation.
Circuit breakers
per model/zone with failover, and
graceful-degradation
knobs (cap output, drop to a smaller tier) that degrade
quality
before
availability
.
A
load-shedding priority
order (shed batch/over-quota first; protect within-SLO paid streams last).
Part 7 — Idempotency, Retries, Timeouts, and Cold-Start Mitigation
Critique and improve two coupled areas: (a) idempotency, retry, timeout, and cancellation semantics — especially for streaming; and (b) cold-start mitigation for loading large models.
What This Part Should Cover
Idempotency keys
(duplicate suppression with a TTL) and a retry policy bounded by failure timing, with backoff +
jitter
and deadline/
min(client, tenant)
timeout propagation.
Cancellation that reclaims the KV slot
end-to-end on client disconnect.
Cold-start mitigation:
warm pools
(10–20% buffer),
weight-cache hierarchy
with integrity checks, and snapshot/restore tied to the autoscaler (scale-out = promotion from warm).
Part 8 — Caching and Streaming
Critique the caching strategy and the streaming protocol; propose the high-value additions. Cover caching of weights and KV/prompt prefixes, and the response/token streaming protocol with its flush policy.
What This Part Should Cover
A
three-layer
cache story: weights/tokenizer (NVMe LRU + integrity check),
prefix/KV cache
(HBM→NVMe spill), and a narrowly-scoped response cache (temperature-0, short TTL, per-tenant).
A streaming protocol (SSE / HTTP/2) with
prompt flushing
, heartbeats/keep-alives,
backpressure-aware
flushing, and a terminal
finish-reason
event (stop / length / content-filter / cancel).
Streaming safety moderation
as part of the protocol (see Part 9), not bolted on afterward.
Part 9 — Monitoring, Privacy, Safety, Audit, and Cost Controls
Critique the observability and the privacy/safety/cost posture; propose improvements. Define SLIs, dashboards, and burn-rate alerts, plus data retention, encryption, safety filtering, audit logging, and cost budgets.
What This Part Should Cover
SLIs
(availability, TTFT/ITL percentiles, queue wait, stream-completion, KV hit rate, OOM)
plus cost SLIs
(tokens/s/GPU, $/1k tokens, idle-GPU minutes), sliced by tenant/model/zone with
multi-burn-rate
alerts.
Part 10 — High-Level Architecture and Key Trade-offs
Pull it together: present a logical architecture for the improved design and call out the load-bearing trade-offs.
What This Part Should Cover
A logical architecture (gateway → router → admission → inference runtime → GPU nodes) with the
control-plane / data-plane
separation and the cancellation path made explicit.
The key
trade-offs
discussed with a chosen position: latency vs. throughput (batch window), isolation vs. utilization (MIG vs. packed), disaggregated vs. colocated prefill/decode, rollout speed vs. safety, cost vs. quality.
What a Strong Answer Covers
Across all parts, a strong review demonstrates these cross-cutting qualities (beyond the per-Part dimensions above):
A consistent
review posture
: critique → concrete improvement, leading with the few changes that alter the architecture rather than cosmetic nits.
Reasoning anchored in
LLM serving physics
throughout — prefill vs. decode, KV-cache memory as the real concurrency bound, continuous batching, tokens/sec as the planning unit — applied consistently, not just in the capacity Part.
Intellectual honesty about numbers
: SLO targets and capacity figures are presented as calibrate-from-benchmarks, with the method and formulas spelled out rather than invented constants.
A coherent thread from
SLOs → capacity → autoscaling → overload → rollback → observability
: each area's choices reinforce the others (e.g., the same burn-rate alarm drives both alerting and auto-rollback; the same deadline propagates through admission, retries, and cancellation).
Treating
safety, privacy, and cost
as v1 requirements, not deferrable extras — appropriate for a safety-focused inference provider.
Follow-up Questions
Walk through what happens, second by second, when an
entire AZ fails
at peak: how does admission, autoscaling, warm-pool promotion, and the
z−1z
provisioning interact to keep you within SLO?
A canary looks healthy on TTFT and error rate but its
$/1k-token cost is 2× the incumbent
. Should it pass the gate? How do you encode "correct but too expensive" as a rollout regression?
Would you
disaggregate prefill and decode
into separate pools here? Walk through the throughput win versus the KV-transfer complexity and the extra failure surface, and the traffic profile under which it pays off.
How do you
load-test
this realistically so the numbers transfer to production — what must the synthetic prompt/output length distribution capture, and which failure drills (weight-cache-miss storm, OOM, router misconfig) would you run before full rollout?