Design a GPU inference API
Company: Anthropic
Role: Software Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
Design a GPU-backed inference API for serving multiple ML models. Requirements: low-latency online inference with clear SLOs, high throughput via dynamic batching, model versioning and A/B routing, autoscaling across heterogeneous GPU nodes, queueing and backpressure, multi-tenant isolation and quotas, model loading/warmup, GPU memory management (weights and KV/cache), and fault tolerance. Describe the architecture (API gateway, scheduler, batching layer, workers, model registry, control plane), request flow, streaming vs unary RPC, observability (metrics/traces/logs), canarying and rollbacks, cost controls, and security considerations.
Quick Answer: This question evaluates a candidate's understanding of GPU-backed inference platforms, distributed systems and scheduling, model lifecycle and GPU memory management, autoscaling across heterogeneous nodes, multi-tenant isolation, and operational concerns such as observability, reliability, cost controls, and security.