PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/Anthropic

Review an inference API design for scale

Last updated: Jun 25, 2026

Quick Overview

This question tests a candidate's ability to critically evaluate ML system designs for production-scale inference APIs, covering multi-tenancy, GPU resource constraints, and streaming token delivery. It assesses architectural reasoning across reliability, latency trade-offs, and capacity planning in the ML Systems domain.

  • hard
  • Anthropic
  • ML System Design
  • Software Engineer

Review an inference API design for scale

Company: Anthropic

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

You are reviewing another engineer’s design doc for a machine-learning inference API. Critique and improve it with a focus on distributed systems: clarify product and latency/availability SLOs; estimate throughput and capacity; propose autoscaling, batching, and GPU/accelerator scheduling; handle model loading, versioning, and rollback; design multi-tenant isolation and rate limiting; prevent overload with backpressure, queues, and circuit breakers; define idempotency, retries, and timeouts; mitigate cold starts; specify caching strategy (weights, tokens) and token streaming; plan traffic shaping (canary, A/B), shadowing, and safe rollback; define monitoring, alerting, and error budgets; address privacy, safety filters, audit logs, and cost controls. Provide a high-level architecture and call out key trade-offs.

Quick Answer: This question tests a candidate's ability to critically evaluate ML system designs for production-scale inference APIs, covering multi-tenancy, GPU resource constraints, and streaming token delivery. It assesses architectural reasoning across reliability, latency trade-offs, and capacity planning in the ML Systems domain.

Related Interview Questions

  • Design Model Weight Distribution - Anthropic (medium)
  • Design GPU inference request batching - Anthropic
  • How do you handle an LLM agents interview? - Anthropic (hard)
  • Design a prompt playground - Anthropic (medium)
  • Design a model downloader - Anthropic (medium)
|Home/ML System Design/Anthropic

Review an inference API design for scale

Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
hardSoftware EngineerOnsiteML System Design
18
0

System Design Review: A Machine-Learning Inference API at Scale

Background

You are reviewing a teammate's design document for a production machine-learning inference API that serves text-generation models (e.g., chat/completions) with token streaming. The service is multi-tenant and must run across multiple availability zones (AZs) on GPUs/accelerators.

Assume typical LLM workloads — a prompt prefill phase followed by token-by-token decode — with dynamic batching and a mix of small ("fast") and large ("quality") model SKUs. The system must support safe model rollouts, strong SLOs, and cost controls.

This is a design-review exercise, not a greenfield design. Your job is to critique the document and propose concrete improvements: separate what is right from what is missing or wrong, and push the design to a production bar. Work through the Parts below. Lead with the few issues that change the architecture; do not nitpick formatting.

Constraints & Assumptions

  • Multi-tenant, multi-AZ, GPU/accelerator-backed; mix of small and large model SKUs.
  • LLM serving physics apply: requests have a prefill phase (process the prompt) and a decode phase (generate output token by token); these consume different resources.
  • Latency is perceived as two numbers, not one: time-to-first-token (TTFT) and inter-token latency (ITL) .
  • Concurrency is bounded by KV-cache memory on each accelerator, which grows with batch size and context length.
  • Treat absolute SLO numbers, per-GPU token rates, and capacity figures as quantities you would calibrate from load tests / benchmarks , not memorized constants. State the shape of the reasoning and the formulas; pick illustrative numbers only to demonstrate the method.

Clarifying Questions to Ask

Scope the whole review before critiquing any single area:

  • What is the traffic profile — peak QPS, the distribution of prompt-token and output-token lengths, and the small-vs-large SKU mix? (Tail length, not the mean, drives KV memory and tail latency.)
  • What SLO tier(s) exist (e.g., interactive low-latency vs. async batch), and is there an explicit availability target and error budget already?
  • Which endpoints are in scope — streaming generations only, or also non-streaming, embeddings, and async batch?
  • What are the isolation and compliance requirements between tenants (data residency, retention, "no training on customer data" defaults)?
  • What hardware is assumed (accelerator type, HBM per device), and is MIG / partitioning available?
  • How fast must rollback of a bad model version be, and what is the acceptable blast radius of a rollout?

Part 1 — Product Scope, APIs, and SLOs

Critique how the doc defines its surface area and its latency/availability SLOs, then propose improvements. Pin down which endpoints exist and their distinct latency profiles. Define latency SLOs appropriate to streaming generation, and an availability SLO with an explicit error budget.

What This Part Should Cover

  • Splitting latency into TTFT and ITL (per-token) SLOs, set per model tier , rather than one blended P99.
  • A correct availability definition (success excluding intentional 4xx) with a concrete error budget (e.g., 99.9% ⇒ ~43 min/month) tied to rollout gates and alerts.
  • Distinguishing endpoint profiles (streaming generations vs. embeddings vs. async batch) and a streaming-specific SLI such as stream-completion rate.

Part 2 — Throughput and Capacity Planning

The doc estimates capacity from per-request latency and QPS. Critique that approach and produce a correct capacity model. Estimate the GPUs needed given model characteristics (prefill and decode tokens/s per GPU) and average request sizes, then add headroom and regional redundancy.

What This Part Should Cover

  • A tokens/sec capacity model that sizes prefill and decode independently (e.g., RPSprefill=Tprefill/L\text{RPS}_{\text{prefill}}=T_{\text{prefill}}/LRPSprefill​=Tprefill​/L , RPSdecode=Tdecode/O\text{RPS}_{\text{decode}}=T_{\text{decode}}/ORPSdecode​=Tdecode​/O ) and takes the binding constraint.
  • Distinguishing per-GPU batch utilization from fleet utilization (headroom for bursts/jitter/autoscale lag) without double-counting, plus zone redundancy (survive losing one of zzz AZs).
  • Recognizing the KV-cache memory ceiling on concurrency and naming a mitigation (paged/block KV).

Part 3 — Autoscaling, Batching, and Accelerator Scheduling

Critique the autoscaling signal, the batching policy, and the accelerator scheduling plan; propose improvements for each. Define the scaling signals, the dynamic batching window/policy, and how GPUs are scheduled (partitioning, packing, preemption, warm pools).

What This Part Should Cover

  • Replacing GPU-utilization scaling with queue-/token-aware signals, with warm-pool promotion and asymmetric (fast-out, slow-in) cooldowns.
  • Continuous batching with an adaptive micro-batch admission window and length-aware grouping; separating prefill from decode scheduling (chunked prefill).
  • Accelerator scheduling: MIG vs. full-GPU trade-off by tier, bin-pack by KV footprint (not request count), and preemption of low-priority work.

Part 4 — Model Loading, Versioning, Rollout, and Rollback

Critique how the doc handles model versions, deployment, traffic shaping, and rollback; propose improvements. Cover an immutable model registry, preload/warm mechanisms, safe rolling updates with canary/A-B/shadow traffic, blast-radius limits, and fast rollback.

What This Part Should Cover

  • Immutable, content-addressed versions whose manifest pins weights and tokenizer, sampling defaults, and safety policy.
  • Canary/A-B/shadow with sticky assignment, gating on TTFT/error-rate and cost ($/1k tokens), with blast-radius limits.
  • Sub-minute rollback via a warm prior version wired to the burn-rate alarm; shadow traffic isolated so it can't steal production capacity or reach users.

Part 5 — Multi-Tenant Isolation and Rate Limiting

Critique the multi-tenant story and improve it. Define per-tenant quotas, concurrency caps, fair queuing, and isolation across compute, memory, and network.

What This Part Should Cover

  • Token-based quotas (TPM/RPM, concurrency caps, max context/output) enforced at the edge and again at admission (so retries can't bypass).
  • Weighted-fair queuing with priority classes by plan.
  • Isolation tiers across compute (dedicated/MIG vs. packed), memory (per-tenant KV budget with degrade-on-overflow), and network (per-stream egress fairness).

Part 6 — Overload Protection and Resilience

The doc has weak overload handling. Critique it and design "shed early, shed cheaply." Cover admission control, bounded queues with TTLs, backpressure, circuit breakers, graceful degradation, and a load-shedding priority order.

What This Part Should Cover

  • Bounded per-tenant queues + queue TTL with early 429 rejection and deadline propagation.
  • Circuit breakers per model/zone with failover, and graceful-degradation knobs (cap output, drop to a smaller tier) that degrade quality before availability .
  • A load-shedding priority order (shed batch/over-quota first; protect within-SLO paid streams last).

Part 7 — Idempotency, Retries, Timeouts, and Cold-Start Mitigation

Critique and improve two coupled areas: (a) idempotency, retry, timeout, and cancellation semantics — especially for streaming; and (b) cold-start mitigation for loading large models.

What This Part Should Cover

  • Idempotency keys (duplicate suppression with a TTL) and a retry policy bounded by failure timing, with backoff + jitter and deadline/ min(client, tenant) timeout propagation.
  • Cancellation that reclaims the KV slot end-to-end on client disconnect.
  • Cold-start mitigation: warm pools (10–20% buffer), weight-cache hierarchy with integrity checks, and snapshot/restore tied to the autoscaler (scale-out = promotion from warm).

Part 8 — Caching and Streaming

Critique the caching strategy and the streaming protocol; propose the high-value additions. Cover caching of weights and KV/prompt prefixes, and the response/token streaming protocol with its flush policy.

What This Part Should Cover

  • A three-layer cache story: weights/tokenizer (NVMe LRU + integrity check), prefix/KV cache (HBM→NVMe spill), and a narrowly-scoped response cache (temperature-0, short TTL, per-tenant).
  • A streaming protocol (SSE / HTTP/2) with prompt flushing , heartbeats/keep-alives, backpressure-aware flushing, and a terminal finish-reason event (stop / length / content-filter / cancel).
  • Streaming safety moderation as part of the protocol (see Part 9), not bolted on afterward.

Part 9 — Monitoring, Privacy, Safety, Audit, and Cost Controls

Critique the observability and the privacy/safety/cost posture; propose improvements. Define SLIs, dashboards, and burn-rate alerts, plus data retention, encryption, safety filtering, audit logging, and cost budgets.

What This Part Should Cover

  • SLIs (availability, TTFT/ITL percentiles, queue wait, stream-completion, KV hit rate, OOM) plus cost SLIs (tokens/s/GPU, $/1k tokens, idle-GPU minutes), sliced by tenant/model/zone with multi-burn-rate alerts.
  • Privacy/security : TLS, KMS-encryption at rest, RBAC, configurable retention, default no-training-on-customer-data, log redaction, immutable audit logs.
  • Safety : input pre-filter + streaming post-filter with early-stop, per-tenant policy packs, logged interventions.
  • Cost controls : per-tenant budgets/spend alerts, enforced caps, default-to-smallest-sufficient-model routing, preemptible capacity for batch.

Part 10 — High-Level Architecture and Key Trade-offs

Pull it together: present a logical architecture for the improved design and call out the load-bearing trade-offs.

What This Part Should Cover

  • A logical architecture (gateway → router → admission → inference runtime → GPU nodes) with the control-plane / data-plane separation and the cancellation path made explicit.
  • The key trade-offs discussed with a chosen position: latency vs. throughput (batch window), isolation vs. utilization (MIG vs. packed), disaggregated vs. colocated prefill/decode, rollout speed vs. safety, cost vs. quality.

What a Strong Answer Covers

Across all parts, a strong review demonstrates these cross-cutting qualities (beyond the per-Part dimensions above):

  • A consistent review posture : critique → concrete improvement, leading with the few changes that alter the architecture rather than cosmetic nits.
  • Reasoning anchored in LLM serving physics throughout — prefill vs. decode, KV-cache memory as the real concurrency bound, continuous batching, tokens/sec as the planning unit — applied consistently, not just in the capacity Part.
  • Intellectual honesty about numbers : SLO targets and capacity figures are presented as calibrate-from-benchmarks, with the method and formulas spelled out rather than invented constants.
  • A coherent thread from SLOs → capacity → autoscaling → overload → rollback → observability : each area's choices reinforce the others (e.g., the same burn-rate alarm drives both alerting and auto-rollback; the same deadline propagates through admission, retries, and cancellation).
  • Treating safety, privacy, and cost as v1 requirements, not deferrable extras — appropriate for a safety-focused inference provider.

Follow-up Questions

  • Walk through what happens, second by second, when an entire AZ fails at peak: how does admission, autoscaling, warm-pool promotion, and the zz−1\frac{z}{z-1}z−1z​ provisioning interact to keep you within SLO?
  • A canary looks healthy on TTFT and error rate but its $/1k-token cost is 2× the incumbent . Should it pass the gate? How do you encode "correct but too expensive" as a rollout regression?
  • Would you disaggregate prefill and decode into separate pools here? Walk through the throughput win versus the KV-transfer complexity and the extra failure surface, and the traffic profile under which it pays off.
  • How do you load-test this realistically so the numbers transfer to production — what must the synthetic prompt/output length distribution capture, and which failure drills (weight-cache-miss storm, OOM, router misconfig) would you run before full rollout?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic ML System Design•Software Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.