Design a highly available, multi-region service that handles 50k peak QPS with a p95 latency under 100 ms. Specify API design, storage schema, caching strategy, consistency model, data partitioning, failure handling, and rollout/canary strategies. Perform back-of-the-envelope capacity planning: estimate read/write ratios, data size growth over 12 months, peak vs. average load, instance sizing, and network egress. Build a performance model to predict end-to-end latency under load: decompose service time, apply queueing approximations (e.g., Little’s Law), and identify the bottlenecks. Propose concrete mitigations (e.g., batching, async workflows, indexes, autoscaling, circuit breaking) and define SLOs, monitoring, and load-testing plans to validate your model.

This question evaluates a candidate's ability in large-scale distributed system design, performance modeling, capacity planning, and operational reliability for latency-sensitive, multi-region key-value services.

How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Onsite rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Anthropic during technical interviews.

Design a scalable service and model performance

System Design: Multi-Region, 50k QPS, p95 < 100 ms

Context

Design an online, read-heavy key-value service (for example, a user profile or feature lookup) used by latency-sensitive applications worldwide. Clients connect from multiple continents. The service must be highly available across multiple regions and maintain low tail latency.

Assume small payloads (1–5 KB), id-based access patterns, and that strong read-after-write consistency is required within a region for a session, but cross-region consistency can be eventual.

Requirements

Traffic target: 50k peak QPS (global), p95 latency under 100 ms.
Multi-region, active-active, highly available design with zero single-region dependency.
Include APIs, data model, caching, consistency, partitioning, failure handling, rollout/canary.
Do back-of-the-envelope capacity planning (reads vs writes, growth, peak vs average, instance sizing, egress).
Build a performance model to predict end-to-end latency under load (service time breakdown, queueing approximations such as Little’s Law), and identify bottlenecks.
Propose concrete mitigations and define SLOs, monitoring, and load-testing to validate the model.

Deliverables

API design (CRUD, batch, idempotency, versioning, errors).
Storage schema and indexing; partitioning strategy.
Caching layers and invalidation strategy.
Consistency model (regional and cross-region) and conflict resolution.
Failure handling (zone, region, network partitions, thundering herd) and client resiliency.
Rollout and canary strategies (schema and code).
Capacity planning with numerical estimates: read/write ratio, data growth over 12 months, peak vs average, instance count and size, and network egress.
Performance model with queueing approximations; identify bottlenecks.
Mitigations (e.g., batching, async, indexes, autoscaling, circuit breaking).
SLOs, monitoring, and load-testing plans to validate performance and availability.

Context

Assume small payloads (1–5 KB), id-based access patterns, and that strong read-after-write consistency is required within a region for a session, but cross-region consistency can be eventual.

Requirements

Traffic target: 50k peak QPS (global), p95 latency under 100 ms.

Multi-region, active-active, highly available design with zero single-region dependency.

Include APIs, data model, caching, consistency, partitioning, failure handling, rollout/canary.

Do back-of-the-envelope capacity planning (reads vs writes, growth, peak vs average, instance sizing, egress).

Build a performance model to predict end-to-end latency under load (service time breakdown, queueing approximations such as Little’s Law), and identify bottlenecks.

Propose concrete mitigations and define SLOs, monitoring, and load-testing to validate the model.

Deliverables

API design (CRUD, batch, idempotency, versioning, errors).

Storage schema and indexing; partitioning strategy.

Caching layers and invalidation strategy.

Consistency model (regional and cross-region) and conflict resolution.

Failure handling (zone, region, network partitions, thundering herd) and client resiliency.

Rollout and canary strategies (schema and code).

Capacity planning with numerical estimates: read/write ratio, data growth over 12 months, peak vs average, instance count and size, and network egress.

Performance model with queueing approximations; identify bottlenecks.

Mitigations (e.g., batching, async, indexes, autoscaling, circuit breaking).

SLOs, monitoring, and load-testing plans to validate performance and availability.

Design a scalable service and model performance

Quick Overview

System Design: Multi-Region, 50k QPS, p95 < 100 ms

Context

Requirements

Deliverables

Solution

Comments (0)

Design a scalable service and model performance

Quick Overview

System Design: Multi-Region, 50k QPS, p95 < 100 ms

Context

Requirements

Deliverables

Solution

Comments (0)