Design scalable rate limiter with token bucket

Q: Design scalable rate limiter with token bucket

This is a System Design interview question from Altruist for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

You are asked to design and implement a rate limiter for an HTTP API.

Requirements

Limit each client (e.g., user ID or API key) to at most N requests per T seconds.
The rate limiter should run in memory (assume a single application instance to start).
It must be:
- Low latency (per-request check must be fast).
- Memory efficient even when there are many distinct clients and very high traffic.
- Easy to extend later to a distributed setting.

The interviewer guides you through the following steps:

Baseline design (sliding window):
- Describe how you would implement a basic sliding-window rate limiter.
- Be explicit about what data you store per client, how you decide whether to allow or reject a request given its timestamp, and the time and space complexity.
Scalability concern:
- Explain what happens when traffic becomes extremely large (e.g., millions of clients, each sending many requests per second).
- Why might the naive sliding-window implementation run into memory and performance issues?
**Improved design (token bucket with lazy refill):
- Redesign the rate limiter using a token bucket algorithm so that memory usage stays small and per-request operations remain O(1).
- You should:
  - Describe the conceptual model of a token bucket (capacity, refill rate, token consumption on each request).
  - Specify exactly what you store per client (e.g., current token count, last refill timestamp, etc.).
  - Explain how to perform a lazy refill : instead of refilling tokens every second via a background job or timer, you only recalculate and refill tokens when a request for that client arrives, using the request's timestamp.
  - Show how the lazy refill calculation works for a request arriving at time now , given last_refill_time , current_tokens , bucket capacity , and refill_rate (tokens per second).
- Discuss how this design solves the large-data, low-memory problem compared to the sliding-window implementation.
Additional considerations:
- How would you handle concurrency (multiple threads) accessing the same client's rate-limit state?
- How would you clean up or evict inactive clients so the in-memory data structure does not grow without bound?
- Briefly mention how you might extend this design to multiple application servers (e.g., by storing state in a shared store like Redis), but focus on the core token-bucket-with-lazy-refill design.

Provide a detailed, system-design level answer: data structures, algorithms, complexity, trade-offs between sliding window and token bucket, and how lazy refill works in practice.

Design scalable rate limiter with token bucket

Requirements

Solution

Comments (0)