PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/Anthropic

Design a GPU inference API

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's understanding of GPU-backed inference platforms, distributed systems and scheduling, model lifecycle and GPU memory management, autoscaling across heterogeneous nodes, multi-tenant isolation, and operational concerns such as observability, reliability, cost controls, and security.

  • hard
  • Anthropic
  • ML System Design
  • Software Engineer

Design a GPU inference API

Company: Anthropic

Role: Software Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design a GPU-backed inference API for serving multiple ML models. Requirements: low-latency online inference with clear SLOs, high throughput via dynamic batching, model versioning and A/B routing, autoscaling across heterogeneous GPU nodes, queueing and backpressure, multi-tenant isolation and quotas, model loading/warmup, GPU memory management (weights and KV/cache), and fault tolerance. Describe the architecture (API gateway, scheduler, batching layer, workers, model registry, control plane), request flow, streaming vs unary RPC, observability (metrics/traces/logs), canarying and rollbacks, cost controls, and security considerations.

Quick Answer: This question evaluates a candidate's understanding of GPU-backed inference platforms, distributed systems and scheduling, model lifecycle and GPU memory management, autoscaling across heterogeneous nodes, multi-tenant isolation, and operational concerns such as observability, reliability, cost controls, and security.

Related Interview Questions

  • Design GPU inference request batching - Anthropic
  • How do you handle an LLM agents interview? - Anthropic (hard)
  • Design a prompt playground - Anthropic (medium)
  • Design a model downloader - Anthropic (medium)
  • Design a high-concurrency LLM inference service - Anthropic (hard)
Anthropic logo
Anthropic
Aug 1, 2025, 12:00 AM
Software Engineer
Onsite
ML System Design
58
0

System Design: GPU-Backed Multi-Model Inference API

Context

Design a production-grade inference platform for serving multiple ML models (e.g., LLMs, vision, and classic DL models) backed by GPUs. The platform must meet strict latency SLOs for online traffic while achieving high throughput via dynamic batching. It should support model versioning with A/B routing, autoscale across heterogeneous GPU nodes, provide isolation and quotas for multiple tenants, and remain fault-tolerant.

Assume global deployment in at least two regions, gRPC/HTTP-based clients, and a mix of streaming and unary requests.

Requirements

  • SLOs and Latency: Low-latency online inference with clear SLOs (e.g., P95 time-to-first-byte/token and completion). High availability.
  • Throughput: Dynamic batching while respecting SLOs.
  • Models: Multiple models, versioning, A/B testing and routing.
  • Autoscaling: Across heterogeneous GPU node types (e.g., A10, A100, H100), with pre-warming and scale-to-zero support.
  • Queueing and Backpressure: Fairness and SLO-aware admission control.
  • Multi-tenant Isolation: Per-tenant quotas, budgets, and safety limits.
  • Lifecycle: Model loading, warmup, cache priming.
  • GPU Memory Management: Weights residency, KV/cache planning, and eviction.
  • Fault Tolerance: Retries, draining, canarying, and rollbacks.
  • Architecture: Describe API gateway, scheduler/router, batching layer, workers, model registry, and control plane.
  • Protocols: Streaming vs unary RPC.
  • Observability: Metrics, traces, logs.
  • Cost Controls: Efficiency, scaling policies, and spend guardrails.
  • Security: AuthN/Z, isolation, artifact integrity, and data protection.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic ML System Design•Software Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.