Design a Rate Limiter with Per-User Token Quotas
Company: xAI
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
# Design a Rate Limiter with Per-User Token Quotas
Design a distributed rate limiter for a high-traffic API platform, with one twist that drives the whole design: **every user can have a different quota**. For example, one premium user may be allowed to consume 10,000 tokens per second, while an ordinary user is limited to 100 tokens per second. The limiter must enforce each user's individual quota, decide allow/deny on every incoming request with minimal added latency, and keep working correctly when the API is served from many machines.
Walk through your design end to end: the rate-limiting algorithm, where per-user quotas are stored and how they reach the enforcement path, the data model for live counter state, the request-time decision flow, and how the system scales and degrades.
```hint Algorithm choice
A **token bucket with lazy refill** handles heterogeneous quotas naturally: capacity and refill rate are just *parameters of each user's bucket*. Different users get different numbers, not a different algorithm. Refill on read — `tokens = min(capacity, tokens + elapsed * refill_rate)` — so no background timers are needed.
```
```hint Separate the two planes
Split the problem into a **config plane** (the per-user quota definitions: tier defaults plus per-user overrides, changed rarely, cached aggressively) and a **data plane** (the live bucket counters, mutated on every request, kept in a fast shared store like Redis). Most candidates blur these together and get stuck.
```
```hint Atomicity and hot keys
The read-refill-check-decrement sequence must be **atomic** (e.g., a Redis Lua script or compare-and-swap), or concurrent requests will double-spend tokens. And a user allowed $10^4$ requests/second concentrates all that traffic on a single counter key — think about splitting one logical bucket into shards or letting gateways pre-claim token batches locally.
```
### Constraints & Assumptions
- Tens of millions of registered users; a small fraction (say, thousands) have elevated custom quotas — assume a tiered model (free / pro / enterprise) plus per-user overrides.
- Aggregate peak on the order of hundreds of thousands to ~1M rate-limit checks per second across the fleet (assumed; confirm with the interviewer).
- The rate-limit decision should add no more than ~1-2 ms to request latency.
- Quotas are expressed as tokens per second; a single request may cost more than one token (e.g., cost proportional to work).
- Quota changes (upgrades, manual overrides) should take effect within seconds to a minute — runtime-changeable, no redeploy.
- The API is served by many stateless nodes behind a load balancer; a user's requests may land on any node.
### Clarifying Questions to Ask
- Is the limit a hard guarantee (never exceed) or is small, bounded over-admission acceptable in exchange for lower latency?
- What should happen when the limiter's backing store is unavailable — fail open (admit traffic) or fail closed (reject)?
- Do users need burst headroom (short spikes above the sustained rate), or strict per-second smoothing?
- At what layer is the limit enforced — a shared API gateway, a sidecar, or inside each service?
- Is enforcement single-region, or must one user's quota be shared globally across regions?
- What should a rejected caller receive — an HTTP 429 with retry guidance, queuing, or degraded service?
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- How would you extend the design so a user's quota is enforced globally across multiple regions without adding cross-region latency to every request?
- How would you support burst allowances — e.g., a user sustained at 100 tokens/sec who may briefly spike to 1,000 — without changing the algorithm?
- A user is upgraded mid-flight from 100 to 10,000 tokens/sec. Trace exactly how and when the new quota takes effect in your design, and what happens to their in-flight bucket state.
- How would you rate-limit on multiple dimensions at once (per user, per IP, and per endpoint), and in what order would you evaluate them?
Quick Answer: This question evaluates a candidate's ability to design distributed, low-latency rate limiting with heterogeneous per-user quotas, covering competencies in state modeling, consistency, caching, and scalable architecture.