Cracking the System Design Interview: Rate Limiting Algorithms Explained

When designing scalable, highly available APIs, ensuring your system doesn't buckle under sudden traffic spikes or malicious DDoS attacks is non-negotiable. In a standard backend system design interview, whether you’re aiming for an L5 position at Google or an E5 role at Meta, the interviewer will often probe your understanding of rate limiting.

Rate limiting is the practice of restricting the number of requests a user (or service) can make to an API within a specified time window. In this technical deep-dive, we will explore the core algorithms used to implement rate limiters, breaking down their architectural trade-offs, space-time complexities, and production use cases.

1. Token Bucket Algorithm

The Token Bucket algorithm is arguably the most common rate-limiting pattern in the industry, heavily utilized by companies like Amazon and Stripe to throttle API requests.

How It Works

Imagine a bucket with a fixed capacity. Tokens are added to the bucket at a fixed, predefined rate. When an API request comes in, the system checks if the bucket contains at least one token. If it does, a token is removed, and the request is processed. If the bucket is empty, the request is dropped with an HTTP 429 Too Many Requests status code.

Architectural Trade-offs

Pros: It's highly memory efficient and allows for burst traffic. If the bucket is full, a sudden burst of requests can be handled concurrently until the bucket is drained.
Cons: Tuning the bucket size and refill rate for distributed systems can be complex, often requiring distributed caches like Redis to synchronize state across multiple API gateway nodes.

import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate
        self.last_refill_timestamp = time.time()

    def allow_request(self) -> bool:
        self._refill()
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

    def _refill(self):
        now = time.time()
        time_passed = now - self.last_refill_timestamp
        new_tokens = time_passed * self.refill_rate
        
        if new_tokens > 0:
            self.tokens = min(self.capacity, self.tokens + new_tokens)
            self.last_refill_timestamp = now

2. Leaky Bucket Algorithm

While similar to the Token Bucket, the Leaky Bucket focuses on processing requests at a strictly constant rate.

How It Works

Requests are added to a queue (the bucket). If the queue is full, the new requests are discarded. A background worker pulls requests from the queue at a fixed, steady rate and processes them.

Architectural Trade-offs

Pros: It smooths out traffic bursts, preventing backend databases or downstream services from being overwhelmed. It provides extremely predictable processing rates.
Cons: A large burst of requests can fill up the queue, causing newer, potentially critical requests to be dropped while older requests slowly drain out. Unlike Token Bucket, it does not handle burst traffic efficiently.

3. Fixed Window Counter

This is the simplest algorithm to implement, partitioning time into discrete, fixed windows (e.g., 12:00:00 to 12:01:00) and assigning a counter to each window.

Architectural Trade-offs

Pros: Extremely low memory footprint and simple to implement in Redis using the INCR command and key expirations.
Cons: It suffers from the boundary condition spike. If your limit is 100 requests per minute, a user could send 100 requests at 12:00:59, and another 100 at 12:01:01, effectively pushing 200 requests to your backend within a two-second interval.

4. Sliding Window Log & Counter

To solve the boundary issue of the Fixed Window, the Sliding Window Log stores the timestamp of every request. When a new request arrives, outdated timestamps are evicted, and the system counts the log size. If it exceeds the limit, the request is blocked.

The Sliding Window Counter is a hybrid approach that uses a weighted average of the previous window's request count and the current window's request count to approximate the rate, heavily reducing the massive memory requirements of the Sliding Window Log.

Nailing the Interview with PracHub

Understanding these algorithms in theory is only half the battle; knowing how to implement them on a whiteboard and justify your architectural decisions under pressure is where candidates fail.

This is where PracHub comes in. Our platform allows you to conduct highly realistic, structured mock interviews with verified FAANG engineers. Whether you need to practice coding out a Redis-backed Sliding Window Counter or architecting a globally distributed API Gateway, PracHub’s real-time question along with collaborative environments and expert peer matching ensure you are battle-tested before the real thing. Stop reading theory and start practicing with engineers who have designed these systems in production.

1. Token Bucket Algorithm

The Token Bucket algorithm is arguably the most common rate-limiting pattern in the industry, heavily utilized by companies like Amazon and Stripe to throttle API requests.

How It Works

Architectural Trade-offs

Pros: It's highly memory efficient and allows for burst traffic. If the bucket is full, a sudden burst of requests can be handled concurrently until the bucket is drained.
Cons: Tuning the bucket size and refill rate for distributed systems can be complex, often requiring distributed caches like Redis to synchronize state across multiple API gateway nodes.

import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate
        self.last_refill_timestamp = time.time()

    def allow_request(self) -> bool:
        self._refill()
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

    def _refill(self):
        now = time.time()
        time_passed = now - self.last_refill_timestamp
        new_tokens = time_passed * self.refill_rate
        
        if new_tokens > 0:
            self.tokens = min(self.capacity, self.tokens + new_tokens)
            self.last_refill_timestamp = now

2. Leaky Bucket Algorithm

While similar to the Token Bucket, the Leaky Bucket focuses on processing requests at a strictly constant rate.

How It Works

Requests are added to a queue (the bucket). If the queue is full, the new requests are discarded. A background worker pulls requests from the queue at a fixed, steady rate and processes them.

Architectural Trade-offs

Pros: It smooths out traffic bursts, preventing backend databases or downstream services from being overwhelmed. It provides extremely predictable processing rates.
Cons: A large burst of requests can fill up the queue, causing newer, potentially critical requests to be dropped while older requests slowly drain out. Unlike Token Bucket, it does not handle burst traffic efficiently.

3. Fixed Window Counter

This is the simplest algorithm to implement, partitioning time into discrete, fixed windows (e.g., 12:00:00 to 12:01:00) and assigning a counter to each window.

Architectural Trade-offs

Pros: Extremely low memory footprint and simple to implement in Redis using the INCR command and key expirations.
Cons: It suffers from the boundary condition spike. If your limit is 100 requests per minute, a user could send 100 requests at 12:00:59, and another 100 at 12:01:01, effectively pushing 200 requests to your backend within a two-second interval.

4. Sliding Window Log & Counter

Nailing the Interview with PracHub

Understanding these algorithms in theory is only half the battle; knowing how to implement them on a whiteboard and justify your architectural decisions under pressure is where candidates fail.