How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Onsite rounds at xAI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at xAI during technical interviews.

Design a Rate Limiter with Per-User Token Quotas

Q: Design a Rate Limiter with Per-User Token Quotas

This question evaluates a candidate's ability to design distributed, low-latency rate limiting with heterogeneous per-user quotas, covering competencies in state modeling, consistency, caching, and scalable architecture.

Design a Rate Limiter with Per-User Token Quotas

Design a distributed rate limiter for a high-traffic API platform, with one twist that drives the whole design: every user can have a different quota. For example, one premium user may be allowed to consume 10,000 tokens per second, while an ordinary user is limited to 100 tokens per second. The limiter must enforce each user's individual quota, decide allow/deny on every incoming request with minimal added latency, and keep working correctly when the API is served from many machines.

Walk through your design end to end: the rate-limiting algorithm, where per-user quotas are stored and how they reach the enforcement path, the data model for live counter state, the request-time decision flow, and how the system scales and degrades.

Constraints & Assumptions

Tens of millions of registered users; a small fraction (say, thousands) have elevated custom quotas — assume a tiered model (free / pro / enterprise) plus per-user overrides.
Aggregate peak on the order of hundreds of thousands to ~1M rate-limit checks per second across the fleet (assumed; confirm with the interviewer).
The rate-limit decision should add no more than ~1-2 ms to request latency.
Quotas are expressed as tokens per second; a single request may cost more than one token (e.g., cost proportional to work).
Quota changes (upgrades, manual overrides) should take effect within seconds to a minute — runtime-changeable, no redeploy.
The API is served by many stateless nodes behind a load balancer; a user's requests may land on any node.

Clarifying Questions to Ask

Is the limit a hard guarantee (never exceed) or is small, bounded over-admission acceptable in exchange for lower latency?
What should happen when the limiter's backing store is unavailable — fail open (admit traffic) or fail closed (reject)?
Do users need burst headroom (short spikes above the sustained rate), or strict per-second smoothing?
At what layer is the limit enforced — a shared API gateway, a sidecar, or inside each service?
Is enforcement single-region, or must one user's quota be shared globally across regions?
What should a rejected caller receive — an HTTP 429 with retry guidance, queuing, or degraded service?

What a Strong Answer Covers Premium

Follow-up Questions

How would you extend the design so a user's quota is enforced globally across multiple regions without adding cross-region latency to every request?
How would you support burst allowances — e.g., a user sustained at 100 tokens/sec who may briefly spike to 1,000 — without changing the algorithm?
A user is upgraded mid-flight from 100 to 10,000 tokens/sec. Trace exactly how and when the new quota takes effect in your design, and what happens to their in-flight bucket state.
How would you rate-limit on multiple dimensions at once (per user, per IP, and per endpoint), and in what order would you evaluate them?

Design a Rate Limiter with Per-User Token Quotas

Constraints & Assumptions

Tens of millions of registered users; a small fraction (say, thousands) have elevated custom quotas — assume a tiered model (free / pro / enterprise) plus per-user overrides.
Aggregate peak on the order of hundreds of thousands to ~1M rate-limit checks per second across the fleet (assumed; confirm with the interviewer).
The rate-limit decision should add no more than ~1-2 ms to request latency.
Quotas are expressed as tokens per second; a single request may cost more than one token (e.g., cost proportional to work).
Quota changes (upgrades, manual overrides) should take effect within seconds to a minute — runtime-changeable, no redeploy.
The API is served by many stateless nodes behind a load balancer; a user's requests may land on any node.

Clarifying Questions to Ask

Is the limit a hard guarantee (never exceed) or is small, bounded over-admission acceptable in exchange for lower latency?
What should happen when the limiter's backing store is unavailable — fail open (admit traffic) or fail closed (reject)?
Do users need burst headroom (short spikes above the sustained rate), or strict per-second smoothing?
At what layer is the limit enforced — a shared API gateway, a sidecar, or inside each service?
Is enforcement single-region, or must one user's quota be shared globally across regions?
What should a rejected caller receive — an HTTP 429 with retry guidance, queuing, or degraded service?

What a Strong Answer Covers Premium

Follow-up Questions

How would you extend the design so a user's quota is enforced globally across multiple regions without adding cross-region latency to every request?
How would you support burst allowances — e.g., a user sustained at 100 tokens/sec who may briefly spike to 1,000 — without changing the algorithm?
A user is upgraded mid-flight from 100 to 10,000 tokens/sec. Trace exactly how and when the new quota takes effect in your design, and what happens to their in-flight bucket state.
How would you rate-limit on multiple dimensions at once (per user, per IP, and per endpoint), and in what order would you evaluate them?

Design a Rate Limiter with Per-User Token Quotas

Quick Overview

Design a Rate Limiter with Per-User Token Quotas

Design a Rate Limiter with Per-User Token Quotas

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a Rate Limiter with Per-User Token Quotas

Quick Overview

Design a Rate Limiter with Per-User Token Quotas

Design a Rate Limiter with Per-User Token Quotas

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP