Design scalable rate limiter with token bucket
Company: Altruist
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
You are asked to design and implement a rate limiter for an HTTP API.
### Requirements
- Limit each client (e.g., user ID or API key) to at most `N` requests per `T` seconds.
- The rate limiter should run in memory (assume a single application instance to start).
- It must be:
- Low latency (per-request check must be fast).
- Memory efficient even when there are many distinct clients and very high traffic.
- Easy to extend later to a distributed setting.
The interviewer guides you through the following steps:
1. **Baseline design (sliding window):**
- Describe how you would implement a basic sliding-window rate limiter.
- Be explicit about what data you store per client, how you decide whether to allow or reject a request given its timestamp, and the time and space complexity.
2. **Scalability concern:**
- Explain what happens when traffic becomes extremely large (e.g., millions of clients, each sending many requests per second).
- Why might the naive sliding-window implementation run into memory and performance issues?
3. **Improved design (token bucket with lazy refill):
- Redesign the rate limiter using a token bucket algorithm so that memory usage stays small and per-request operations remain O(1).
- You should:
- Describe the conceptual model of a token bucket (capacity, refill rate, token consumption on each request).
- Specify exactly what you store per client (e.g., current token count, last refill timestamp, etc.).
- Explain how to perform a **lazy refill**: instead of refilling tokens every second via a background job or timer, you only recalculate and refill tokens when a request for that client arrives, using the request's timestamp.
- Show how the lazy refill calculation works for a request arriving at time `now`, given `last_refill_time`, `current_tokens`, bucket `capacity`, and `refill_rate` (tokens per second).
- Discuss how this design solves the large-data, low-memory problem compared to the sliding-window implementation.
4. **Additional considerations:**
- How would you handle concurrency (multiple threads) accessing the same client's rate-limit state?
- How would you clean up or evict inactive clients so the in-memory data structure does not grow without bound?
- Briefly mention how you might extend this design to multiple application servers (e.g., by storing state in a shared store like Redis), but focus on the core token-bucket-with-lazy-refill design.
Provide a detailed, system-design level answer: data structures, algorithms, complexity, trade-offs between sliding window and token bucket, and how lazy refill works in practice.
Quick Answer: This question evaluates understanding of rate-limiting algorithms and scalability concepts, specifically comparing sliding-window and token-bucket approaches with attention to lazy refill, per-client state modeling, memory efficiency, concurrency, and extendability to distributed systems.