System Design: Global, High-Throughput Rate Limiter
Context
You are designing a global, multi-region rate-limiting service that enforces quotas:
-
Per-user (across all APIs)
-
Per-API (across all users)
-
Per-user-per-API (composite)
The system must be low-latency, tolerate bursts, and operate across multiple regions.
Requirements
Design and justify:
-
Algorithm choice (e.g., token bucket vs. leaky bucket) with burst tolerance and latency considerations.
-
Data model for quotas, runtime state, and metrics.
-
Partitioning and sharding strategy for a global deployment.
-
Clock/time-window semantics (fixed vs. sliding windows, monotonic time handling).
-
Consistency trade-offs (global correctness vs. latency) and how you bound overages.
-
Hot-key handling strategies.
-
Scaling strategies and capacity planning math, including how many machines are required if traffic doubles (2×), with clear assumptions and formulas.
-
Failure modes and mitigations.
-
Backpressure behavior and client guidance.
-
Observability: metrics, tracing, dashboards, and alerts.
-
APIs for configuration and metrics.
Assume a mixed workload with both regional and global traffic patterns, and that some users may send traffic from multiple regions concurrently.