You are asked to design and implement a rate limiter for an HTTP API.
Requirements
-
Limit each client (e.g., user ID or API key) to at most
N
requests per
T
seconds.
-
The rate limiter should run in memory (assume a single application instance to start).
-
It must be:
-
Low latency (per-request check must be fast).
-
Memory efficient even when there are many distinct clients and very high traffic.
-
Easy to extend later to a distributed setting.
The interviewer guides you through the following steps:
-
Baseline design (sliding window):
-
Describe how you would implement a basic sliding-window rate limiter.
-
Be explicit about what data you store per client, how you decide whether to allow or reject a request given its timestamp, and the time and space complexity.
-
Scalability concern:
-
Explain what happens when traffic becomes extremely large (e.g., millions of clients, each sending many requests per second).
-
Why might the naive sliding-window implementation run into memory and performance issues?
-
**Improved design (token bucket with lazy refill):
-
Redesign the rate limiter using a token bucket algorithm so that memory usage stays small and per-request operations remain O(1).
-
You should:
-
Describe the conceptual model of a token bucket (capacity, refill rate, token consumption on each request).
-
Specify exactly what you store per client (e.g., current token count, last refill timestamp, etc.).
-
Explain how to perform a
lazy refill
: instead of refilling tokens every second via a background job or timer, you only recalculate and refill tokens when a request for that client arrives, using the request's timestamp.
-
Show how the lazy refill calculation works for a request arriving at time
now
, given
last_refill_time
,
current_tokens
, bucket
capacity
, and
refill_rate
(tokens per second).
-
Discuss how this design solves the large-data, low-memory problem compared to the sliding-window implementation.
-
Additional considerations:
-
How would you handle concurrency (multiple threads) accessing the same client's rate-limit state?
-
How would you clean up or evict inactive clients so the in-memory data structure does not grow without bound?
-
Briefly mention how you might extend this design to multiple application servers (e.g., by storing state in a shared store like Redis), but focus on the core token-bucket-with-lazy-refill design.
Provide a detailed, system-design level answer: data structures, algorithms, complexity, trade-offs between sliding window and token bucket, and how lazy refill works in practice.