Design a low-latency ML inference API
Company: Anthropic
Role: Software Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
Design a low-latency ML inference API for real-time predictions. Specify target SLOs (p50/p95 latency, availability), request/response schema, authentication, rate limiting, and multitenancy. Propose an architecture covering load balancing, stateless API tier, feature retrieval, model serving (CPU/GPU), batching, quantization, caching, and autoscaling strategies. Explain model versioning, canary/rollbacks, online A/B, observability (metrics, tracing, drift, data-quality checks), cost controls, and fallback behavior during partial outages. Address security, PII handling, regionalization, and disaster recovery.
Quick Answer: This question evaluates competency in ML system design for real-time, low-latency inference APIs, including multitenancy, SLO/SLI definition, feature retrieval, model serving and rollout strategies, observability, cost control, and security/compliance within the ML System Design category.