System Design: Multi-Region, 50k QPS, p95 < 100 ms
Context
Design an online, read-heavy key-value service (for example, a user profile or feature lookup) used by latency-sensitive applications worldwide. Clients connect from multiple continents. The service must be highly available across multiple regions and maintain low tail latency.
Assume small payloads (1–5 KB), id-based access patterns, and that strong read-after-write consistency is required within a region for a session, but cross-region consistency can be eventual.
Requirements
-
Traffic target: 50k peak QPS (global), p95 latency under 100 ms.
-
Multi-region, active-active, highly available design with zero single-region dependency.
-
Include APIs, data model, caching, consistency, partitioning, failure handling, rollout/canary.
-
Do back-of-the-envelope capacity planning (reads vs writes, growth, peak vs average, instance sizing, egress).
-
Build a performance model to predict end-to-end latency under load (service time breakdown, queueing approximations such as Little’s Law), and identify bottlenecks.
-
Propose concrete mitigations and define SLOs, monitoring, and load-testing to validate the model.
Deliverables
-
API design (CRUD, batch, idempotency, versioning, errors).
-
Storage schema and indexing; partitioning strategy.
-
Caching layers and invalidation strategy.
-
Consistency model (regional and cross-region) and conflict resolution.
-
Failure handling (zone, region, network partitions, thundering herd) and client resiliency.
-
Rollout and canary strategies (schema and code).
-
Capacity planning with numerical estimates: read/write ratio, data growth over 12 months, peak vs average, instance count and size, and network egress.
-
Performance model with queueing approximations; identify bottlenecks.
-
Mitigations (e.g., batching, async, indexes, autoscaling, circuit breaking).
-
SLOs, monitoring, and load-testing plans to validate performance and availability.