PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/Anthropic

Design a low-latency ML inference API

Last updated: Jun 21, 2026

Quick Overview

This question evaluates competency in ML system design for real-time, low-latency inference APIs, including multitenancy, SLO/SLI definition, feature retrieval, model serving and rollout strategies, observability, cost control, and security/compliance within the ML System Design category.

  • hard
  • Anthropic
  • ML System Design
  • Software Engineer

Design a low-latency ML inference API

Company: Anthropic

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design a low-latency ML inference API for real-time predictions. Specify target SLOs (p50/p95 latency, availability), request/response schema, authentication, rate limiting, and multitenancy. Propose an architecture covering load balancing, stateless API tier, feature retrieval, model serving (CPU/GPU), batching, quantization, caching, and autoscaling strategies. Explain model versioning, canary/rollbacks, online A/B, observability (metrics, tracing, drift, data-quality checks), cost controls, and fallback behavior during partial outages. Address security, PII handling, regionalization, and disaster recovery.

Quick Answer: This question evaluates competency in ML system design for real-time, low-latency inference APIs, including multitenancy, SLO/SLI definition, feature retrieval, model serving and rollout strategies, observability, cost control, and security/compliance within the ML System Design category.

Related Interview Questions

  • Design Model Weight Distribution - Anthropic (medium)
  • Design GPU inference request batching - Anthropic
  • How do you handle an LLM agents interview? - Anthropic (hard)
  • Design a prompt playground - Anthropic (medium)
  • Design a model downloader - Anthropic (medium)
|Home/ML System Design/Anthropic

Design a low-latency ML inference API

Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
hardSoftware EngineerOnsiteML System Design
56
0

System Design: Low-Latency ML Inference API (Real-Time)

Context

You are designing an in-region, synchronous ML inference API that sits on the critical path of product surfaces (e.g., ranking, fraud checks, personalization) which require tight latency and high availability. The service must support multiple tenants, safe model rollouts, and strong observability, while controlling cost.

This is an open-ended design discussion. State your assumptions, propose concrete numerical targets, and walk through the design so that the latency budget provably adds up to the target. The interviewer is looking for your reasoning and trade-off analysis as much as the final architecture.

Address each of the seven Parts below. State explicit numerical targets and trade-offs where applicable, and call out any assumptions that materially shape the design.

Constraints & Assumptions

These are anchoring assumptions to scope the discussion; confirm or adjust them with the interviewer, but design against a concrete operating point rather than leaving everything open:

  • Workload: synchronous request/response; one or a small list of candidates scored per call. No long-running / streaming generation.
  • Models: a mix of classical models (logistic regression, GBDT/XGBoost) servable on CPU and deep models (DNN/transformer) servable on GPU.
  • Traffic: roughly 2k-10k RPS per region at steady state, with occasional 2-3x correlated spikes; bursty and skewed by tenant.
  • Clients are in-region (geo-routed), so cross-region round-trip time is not part of the hot-path latency budget.
  • Features are mostly precomputed and read from an online feature store; a few may be derived at request time from the request payload.

Clarifying Questions to Ask

Before designing, scope the whole problem with the interviewer:

  • Latency contract: what p95/p99 do downstream callers actually require, and is the budget end-to-end (edge → response) or service-internal only?
  • Traffic shape: steady-state and peak RPS per region, the spike multiplier, and how skewed traffic is across tenants?
  • Model mix: what fraction of requests hit deep (GPU) models vs. classical (CPU) models? This drives GPU sizing directly.
  • Feature freshness: how stale can features be? Is the online store eventually consistent vs. an offline/streaming pipeline, and what's the acceptable freshness SLA?
  • Multitenancy: how many tenants, do any require hard isolation, and can tenants pin their own fine-tuned model versions?
  • Compliance footprint: which jurisdictions / data-sovereignty rules apply, and what PII (if any) is in the request payload?

Part 1 — Target SLOs

  • Propose p50 / p95 (and optionally p99) end-to-end latency targets and an availability target.
  • Define your SLIs (how each SLO is measured) and the error budget that follows.

What This Part Should Cover

  • Defensible, concrete numbers for p50/p95 (and ideally p99) latency and a stated availability target, with the percentiles tied back to what downstream callers need.
  • Disjoint SLIs where a slow-but-successful response dings only the latency budget and a degraded-but-correct response is treated as a separate signal, not an availability failure.
  • An error-budget policy derived from the availability target, with multi-window burn-rate alerting and a release-freeze rule when the budget is exhausted.

Part 2 — API Design

  • Define the request/response schema , including idempotency, model/version selection, and metadata for traceability.
  • Specify the authentication and authorization approach.
  • Specify rate limiting and quotas .
  • Describe multitenancy : tenant isolation, quotas, and model routing.

What This Part Should Cover

  • A clean request/response contract with idempotency, alias-vs-version model selection, and traceability metadata echoed back to the caller.
  • AuthN/AuthZ for external callers and a separate internal service-to-service identity story.
  • Rate limiting / quotas with a defensible primitive (e.g., per-tenant token buckets) and explicit overflow behavior (429s, priority classes).
  • A multitenancy model that decides per tenant where hard isolation is required vs. shared-pool-with-quota, and how tenant_id routes to a model.

Part 3 — Architecture

  • Load balancing and edge protections (global routing, WAF, DDoS, request validation).
  • Stateless API tier design.
  • Feature retrieval from the online store: consistency model and TTLs.
  • Model serving choices (CPU vs GPU), dynamic batching, quantization, and caching.
  • Autoscaling strategies for the API tier, feature store, and model servers.

What This Part Should Cover

  • An additive per-stage latency budget that sums to the p95 target, with the two highest-risk stages (feature fetch, inference) given hard deadlines and circuit breakers.
  • A stateless API tier plus global/edge protections (LB, WAF, DDoS, schema validation).
  • A feature-retrieval design with a stated consistency model, TTL-bounded staleness, and batched reads (single round trip).
  • A justified CPU-vs-GPU serving split with an explicit batching/quantization choice tied to the latency budget, not throughput alone.
  • Per-tier autoscaling keyed on the earliest leading indicator of tail pain (not just CPU), plus a cold-start mitigation.

Part 4 — Release Safety and Experimentation

  • Model versioning and registry (what metadata an immutable version stores).
  • Canary / shadow deployment and rollback criteria.
  • Online A/B : assignment, per-arm metrics, and guardrails.

What This Part Should Cover

  • A registry of immutable versions storing the metadata needed to reproduce and gate a model (schema signature, training-data hash, offline metrics, provenance), with aliases decoupled from artifacts.
  • A clear shadow-vs-canary distinction with explicit auto-promote / auto-rollback criteria and a warm last-known-good for sub-second rollback.
  • An online A/B design with deterministic assignment, per-arm business and infra/calibration metrics, and a kill-switch.

Part 5 — Observability and Quality

  • Metrics, logs, and tracing , end-to-end and per stage .
  • Data / feature quality checks and drift detection .

What This Part Should Cover

  • Per-stage latency/error visibility (not just end-to-end) and distributed traces sliceable by tenant / model / experiment arm.
  • Online data-quality checks (nulls, type/range/cardinality violations) with a defined action on violation.
  • Drift detection via a distribution-distance metric on features and on the score distribution.
  • A train/serve-parity check (e.g., feature schema-hash match) that fails closed rather than scoring on malformed input.

Part 6 — Cost and Reliability

  • Cost controls : utilization targets, right-sizing, caching, tiering.
  • Fallback behavior under partial outages or capacity shortfalls.

What This Part Should Cover

  • A steady-state utilization target with spike headroom and right-sizing, plus a unit-economics view (cost per 1k predictions) to make tiering/caching decisions data-driven.
  • Traffic tiering under pressure (cheaper/quantized models for low-value traffic, full-fat GPU reserved for high-value).
  • A fixed fallback ladder where every dependency has a timeout, circuit breaker, and defined fallback, with the invariant that no single stage failure fails the whole request.

Part 7 — Security and Compliance

  • Request security : mTLS, secrets management.
  • PII handling , retention, and auditability.
  • Regionalization / data-sovereignty and a disaster recovery plan.

What This Part Should Cover

  • Defense in depth : external TLS, internal mTLS with service identities, secrets in a KMS, and data minimization at the edge.
  • Concrete PII handling : tokenization, field-level encryption, named retention windows, and audit trails — not just "encrypt everything."
  • Regionalization / data-sovereignty : where PII and models are allowed to live, and how failover avoids cross-region PII copy.
  • Concrete RPO/RTO DR targets backed by tested failover and chaos drills.

What a Strong Answer Covers

These dimensions span all seven parts; the interviewer is listening for them throughout, not in any single section:

  • A back-of-envelope capacity estimate that sizes each tier against the traffic it actually serves (not assuming every request hits a GPU).
  • End-to-end coherence — the SLOs, latency budget, capacity estimate, and fallback ladder are mutually consistent rather than designed in isolation.
  • Explicit trade-off reasoning throughout (batching vs. tail latency, dedicated vs. shared pools, caching vs. staleness, quantization vs. accuracy).
  • Clearly stated assumptions that materially influence the design, surfaced rather than buried.

Follow-up Questions

Be ready for deeper probes after the main design:

  • How does this change at 10x scale (or 100x)? What breaks first — the feature store, GPU pool, or the API tier — and how would you re-architect?
  • A canary's business KPI improves but p99 latency regresses. How does your rollout policy resolve that conflict (Part 4), and what's the automated action?
  • Walk through what happens when the online feature store loses a shard mid-request. Trace it through your fallback ladder (Part 6) and say which SLI (if any) it dings.
  • Where would you add caching, and where is it actively harmful? Justify against hit rate and staleness for personalized, per-request scoring.

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic ML System Design•Software Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.