Design a Distributed Rate Limiter
Company: Anthropic
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
# Design a Distributed Rate Limiter
Design a rate limiter that enforces a per-client request quota (for example, "at most 100 requests per minute per API key") across a **fleet of application servers** sitting behind a load balancer. Any server may receive any client's request, so the limit must be enforced **globally**, not per-server. Cover the limiting algorithm, the shared state, and — critically — how the system behaves correctly despite **clock skew between servers** and how it **degrades when the shared store fails**.
```hint Algorithm choice
Think about what data structure lets many servers check-and-increment one client's counter atomically and cheaply, while still expiring old data on its own.
```
```hint The two hard parts
The interesting requirements aren't the happy path — they're (1) clock skew across servers writing timestamps into shared state, and (2) what the limiter does the moment the central store (Redis) is unreachable.
```
### Constraints & Assumptions
- Tens to hundreds of stateless app servers behind a load balancer; a single client's requests can land on any of them.
- Limits are per-client (per API key / user / IP) and configurable; default example is `N` requests per rolling window `W`.
- A shared low-latency store (e.g. Redis, possibly clustered) holds the counters; the rate-limit check is on the hot path of every request, so it must add minimal latency.
- Servers' wall clocks are roughly but not perfectly synchronized (NTP, with drift up to some tens of milliseconds to seconds).
- The system must keep serving traffic even if the shared store has a partial or full outage.
### Clarifying Questions to Ask
- What exactly is a "client" (API key, user id, IP), and can a single client have multiple limits/tiers?
- Should the limit be a hard cap (reject over-limit) or soft (allow with shaping), and is a small amount of over-admission acceptable?
- What are the latency and availability budgets for the limiter itself — is "fail open" (allow on store outage) or "fail closed" (reject) preferred?
- What window semantics are required: fixed window, sliding window, or token-bucket-style burst allowance?
- What is the request volume and the cost of a slightly inaccurate count at the boundaries?
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- Compare fixed-window, sliding-window-log, sliding-window-counter, and token bucket — which would you pick for smoothing bursts and why, and what is the memory cost of each?
- With a local per-server fallback of `N / serverCount`, what goes wrong if `serverCount` is stale or servers are added/removed, and how do you keep that share roughly correct?
- How do you keep the limiter's own latency low when Redis is healthy — batching, pipelining, local short-TTL caching of "already over limit" verdicts?
- How would you support multiple tiers and burst allowances (e.g. 100/min sustained but allow a short burst of 20) in the same mechanism?
Quick Answer: This question evaluates a candidate's ability to design distributed systems with globally consistent state across multiple servers. It tests system design skills around atomic shared-state operations, reasoning about clock skew, and planning graceful degradation when a dependency fails. Commonly asked to assess practical, production-level architectural judgment rather than pure algorithmic knowledge.