PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/Anthropic

Design a low-latency ML inference API

Last updated: Mar 29, 2026

Quick Overview

This question evaluates competency in ML system design for real-time, low-latency inference APIs, including multitenancy, SLO/SLI definition, feature retrieval, model serving and rollout strategies, observability, cost control, and security/compliance within the ML System Design category.

  • hard
  • Anthropic
  • ML System Design
  • Software Engineer

Design a low-latency ML inference API

Company: Anthropic

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design a low-latency ML inference API for real-time predictions. Specify target SLOs (p50/p95 latency, availability), request/response schema, authentication, rate limiting, and multitenancy. Propose an architecture covering load balancing, stateless API tier, feature retrieval, model serving (CPU/GPU), batching, quantization, caching, and autoscaling strategies. Explain model versioning, canary/rollbacks, online A/B, observability (metrics, tracing, drift, data-quality checks), cost controls, and fallback behavior during partial outages. Address security, PII handling, regionalization, and disaster recovery.

Quick Answer: This question evaluates competency in ML system design for real-time, low-latency inference APIs, including multitenancy, SLO/SLI definition, feature retrieval, model serving and rollout strategies, observability, cost control, and security/compliance within the ML System Design category.

Related Interview Questions

  • Design GPU inference request batching - Anthropic
  • How do you handle an LLM agents interview? - Anthropic (hard)
  • Design a prompt playground - Anthropic (medium)
  • Design a model downloader - Anthropic (medium)
  • Design a high-concurrency LLM inference service - Anthropic (hard)
Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
Software Engineer
Onsite
ML System Design
41
0

System Design: Low‑Latency ML Inference API (Real‑Time)

Context

You are designing an in‑region, synchronous inference API used by product surfaces (e.g., ranking, fraud checks, personalization) that require tight latency and high availability. The service must support multiple tenants, safe model rollouts, and strong observability, while controlling cost.

Requirements

  1. Target SLOs
  • Propose p50/p95 (and optionally p99) end‑to‑end latency targets and availability targets.
  • Define SLIs and error budgets.
  1. API
  • Define request/response schema, including idempotency, model/version selection, and metadata for traceability.
  • Authentication and authorization approach.
  • Rate limiting and quotas.
  • Multitenancy (tenant isolation, quotas, and model routing).
  1. Architecture
  • Load balancing and edge protections.
  • Stateless API tier design.
  • Feature retrieval (online store), consistency, and TTLs.
  • Model serving choices (CPU/GPU), dynamic batching, quantization, caching.
  • Autoscaling strategies for API, feature store, and model servers.
  1. Release Safety and Experimentation
  • Model versioning and registry.
  • Canary/shadow, rollback criteria.
  • Online A/B (assignment, metrics, guardrails).
  1. Observability and Quality
  • Metrics, logs, tracing (end‑to‑end and per stage).
  • Data/feature quality checks and drift detection.
  1. Cost and Reliability
  • Cost controls (utilization targets, right‑sizing, caching, tiering).
  • Fallback behavior under partial outages or capacity shortfalls.
  1. Security and Compliance
  • Request security, mTLS, secrets management.
  • PII handling, retention, and auditability.
  • Regionalization/data‑sovereignty and disaster recovery plan.

Deliverables

  • A concrete proposal covering the above, including clearly stated numerical targets and trade‑offs.
  • Any assumptions you make that influence the design.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic ML System Design•Software Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.