Design a Distributed Rate Limiter
Company: OpenAI
Role: Software Engineer
Category: System Design
Interview Round: Technical Screen
Design a distributed rate limiting system for a large API platform.
The platform runs many API gateways and backend services across multiple regions. For every incoming request, the rate limiter must decide whether the request is **allowed** or **rejected**, based on configurable limits. Your design should support, at minimum:
- **Multi-dimensional limits** — requests per second scoped per user, per API key, per tenant, per IP address, or per endpoint (and combinations thereof).
- **Tier-aware limits** — different limits for different subscription tiers (e.g. free vs. pro vs. enterprise).
- **Burst handling** — short spikes above the steady-state rate should be tolerated up to a defined cap.
- **Dynamic configuration** — limits and rules can be updated without redeploying gateways or backend services.
Your design should address the **request flow**, the **rate-limiter API**, the **data model**, the **rate-limiting algorithm**, **distributed coordination** across gateways and regions, **consistency trade-offs**, **failure handling**, **scalability**, and **observability**.
```hint Frame the problem before the components
"Rate limiter" spans several products. Start by separating **functional** requirements (the allow/reject decision, the multi-dimensional + tiered limits, burst, dynamic config) from **non-functional** ones (latency budget, $10^6$ req/s throughput, availability, multi-region). Also state what's explicitly **out of scope** — e.g. volumetric DDoS defense lives upstream at the edge, not here.
```
```hint Let the numbers drive the architecture
Do a back-of-envelope pass: how many *rules* (and therefore counter operations) does one request trigger if several dimensions apply at once? Multiply by $10^6$ req/s. The op count plus the few-ms budget should force three conclusions about whether each check can touch a remote service, whether it can scan, and whether counter state can live on one node.
```
```hint Pick the algorithm deliberately
Lay the standard families side by side — fixed window, sliding-window log, sliding-window counter, leaky/token bucket — and grade each on burst tolerance, accuracy, and per-op cost. Watch for the fixed-window **boundary problem** (a window edge can admit up to 2× the limit). Ask which structure can express *both* a burst allowance and a sustained rate at once, and how a per-request weight/`cost` would ride on top for heterogeneous workloads. Reason about the properties; don't commit to one and design it out here.
```
```hint Make the check atomic and O(1)
The read-modify-write of a counter has to collapse into a **single atomic step** at the store, or two concurrent requests both read the old value and both pass. Think about what primitive your chosen store offers to make that one operation indivisible. Separately, ask whether you can derive elapsed-time effects (e.g. refills/decay) **on read** instead of running a background job per key.
```
```hint Scale state without breaking atomicity
A single node can't serve millions of checks/sec, so the counter state has to spread across many nodes. The constraint that shapes *how* you spread it: each atomic check must still land on one node. Ask what you'd partition on so a given counter always resolves to the same place, and what that choice does to a request that triggers several counters at once. Then stress it: a whale tenant or a shared API key concentrates traffic onto one partition — how do you relieve that, and could a gateway-side fast path keep most traffic off the store entirely?
```
```hint Multi-region is the hard part
A cross-region round trip per request is off the table, yet some limits are *global*. That tension forces a spectrum: at one end you enforce everything locally and reconcile global truth out-of-band; at the other you keep one authoritative counter everyone consults. Sketch where the endpoints land on latency, availability, and accuracy, pick a default, and say plainly which property you're trading away — and how a tenant that genuinely needs exact global limits could be handled as an exception.
```
### Constraints & Assumptions
- The platform serves up to **1 million requests per second (≈ $10^6$ req/s)** globally.
- The rate-limit check sits on the **hot path** of every request, so most checks must complete in **a few milliseconds** (single-digit-ms p99 budget).
- Gateways and backends are deployed across **multiple regions**; you cannot afford a cross-region round trip on every request.
- The limiter must not become a hard single point of failure for the whole platform.
- You may state any additional assumptions explicitly, as long as they don't drop a core requirement above.
### Clarifying Questions to Ask
A strong candidate scopes the problem before designing. Good questions to raise with the interviewer:
- **Accuracy vs. availability:** How strict must *global* enforcement be? Is slightly over-admitting under failure acceptable for ordinary APIs, or are some limits billing-grade and must be exact?
- **Failure default:** When the limiter itself is unavailable, should it **fail open** (allow) or **fail closed** (reject)? Is this uniform, or per-endpoint?
- **Heterogeneous cost:** Do different operations consume different amounts of capacity (a cheap read vs. an expensive model inference)? Should the limiter support a per-request cost/weight?
- **Multiple windows:** Can more than one time window apply to the same principal simultaneously (e.g. 100 req/s *and* 10,000 req/day)?
- **Multi-region routing:** Is a given tenant's traffic pinned to a home region, or can it land in any region?
- **Rule scale:** Roughly how many distinct rules/policies exist — thousands, or millions? (This decides whether the full ruleset can live in memory on every node.)
### What a Strong Answer Covers
The interviewer is looking for these **dimensions** (signals), not a single right answer:
- **Requirements split** — functional vs. non-functional, with explicit out-of-scope boundaries.
- **Capacity estimate** — request rate, checks per request, counter-op throughput, state size, and what each number implies for the design.
- **Request flow** — how a request is authenticated, how applicable rules are resolved, and how the allow/reject decision is composed when multiple rules apply.
- **API design** — the internal check interface (inputs/outputs) and the client-facing rejection contract (status code + headers).
- **Data model** — how rules/policies are represented (control plane) and how runtime counter state is keyed (data plane).
- **Algorithm choice** — a reasoned comparison of the standard options and a justified default, including how burst and tiering are modeled.
- **Atomicity** — why the check-and-consume step must be a single atomic operation, and what races occur if it isn't.
- **Distributed coordination & scaling** — sharding of counter state, hot-key mitigation, and any local fast path.
- **Multi-region & consistency trade-offs** — how global limits are enforced without a cross-region hop, and where accuracy is traded for latency/availability.
- **Failure handling** — explicit fail-open/fail-closed policy, timeouts/circuit breakers, and control-plane outage behavior.
- **Dynamic config** — validation, versioning, distribution, and safe rollout (shadow/dry-run).
- **Observability & abuse** — what is measured off the hot path; pitfalls of IP-only limiting.
### Follow-up Questions
Be ready for the interviewer to push deeper:
- A single shared API key or a whale tenant concentrates traffic onto **one shard**. What breaks first, and how do you keep enforcing the limit without that shard becoming a bottleneck?
- With a gateway-local token lease, requests can be admitted before the central store sees them. **By how much can the real limit be exceeded**, and which knob bounds that error? Separately, what happens to capacity if a gateway dies holding un-consumed leased tokens?
- A tenant's traffic suddenly shifts from one region to another (a regional outage or a routing change). How does your global limit hold up, and how fast does it adapt?
- For a **billing-grade** quota that must be exact globally, how does your design change — and what does that cost you in latency or availability?
- The control plane (config distribution) goes down for 30 minutes. What do the data-plane limiters do, and what's the worst-case user impact?
Quick Answer: This question evaluates competency in distributed systems architecture, API and data model design, performance engineering, and operational concerns such as rate limiting algorithms, consistency trade-offs, distributed coordination, failure handling, scalability, and observability.