Review an inference API design for scale

Q: Review an inference API design for scale

This is a ML System Design interview question from Anthropic for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

System Design Review: Machine-Learning Inference API (Distributed Systems Focus)

Background

You are reviewing a teammate’s design document for a production machine-learning inference API that serves text-generation models (e.g., chat/completions) with token streaming. The service is multi-tenant and must run across multiple availability zones with GPUs/accelerators.

Assume typical LLM workloads (prompt prefill + token-by-token decode), dynamic batching, and a mix of small and large model SKUs. The system must support safe model rollouts, strong SLOs, and cost controls.

What to Deliver

Critique the design and propose concrete improvements addressing the following areas:

Product and SLOs
- Clarify product scope and APIs (sync/streaming, embeddings vs. generations).
- Define latency SLOs (e.g., time-to-first-token, per-token latency) and availability SLOs with an explicit error budget.
Throughput and Capacity Planning
- Estimate QPS and token throughput given model characteristics (prefill/decode tokens/s per GPU) and average request sizes.
- Size headroom, concurrency, and regional capacity.
Autoscaling, Batching, and Accelerator Scheduling
- Propose request-driven autoscaling signals.
- Describe dynamic batching windows and batching policies.
- Plan GPU/accelerator scheduling (MIG, packing, preemption), and warm pools.
Model Loading, Versioning, and Rollback
- Immutable model versions and registry.
- Preload/warm mechanisms, safe rolling updates, canaries, and fast rollback.
Multi-tenant Isolation and Rate Limiting
- Per-tenant quotas, concurrency caps, and weighted-fair queuing.
- Isolation strategies across CPU/GPU/memory.
Overload and Resilience
- Backpressure, admission control, bounded queues, and circuit breakers.
- Queue TTLs, shedding policy, and graceful degradation.
Idempotency, Retries, and Timeouts
- Idempotency keys and duplicate suppression.
- Retry policies with deadlines and jitter; cancellation propagation.
Cold-start Mitigation
- Weight caching, prewarming, snapshotting/restore, and warm pools.
Caching and Streaming
- Caching for model weights and KV/prompt prefixes.
- Response/token streaming protocol and flush policy.
Traffic Shaping and Rollouts

Canary, A/B, and shadow traffic.
Safe rollback plans and blast-radius limits.

Monitoring, Alerting, and Error Budgets

SLIs, dashboards, burn-rate alerts, per-tenant and per-model views.

Privacy, Safety, Audit, and Cost Controls

Data retention, encryption, safety filters, audit logs.
Cost budgets, spend alerts, and efficiency levers.

High-level Architecture and Trade-offs

Provide a logical architecture and discuss key trade-offs (latency vs. throughput, isolation vs. utilization, complexity vs. operability).

Review an inference API design for scale

System Design Review: Machine-Learning Inference API (Distributed Systems Focus)

Background

What to Deliver

Solution (Locked)

Comments (0)