PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/xAI

Design a Rate Limiter with Per-User Token Quotas

Last updated: Jul 2, 2026

Quick Overview

This question evaluates a candidate's ability to design distributed, low-latency rate limiting with heterogeneous per-user quotas, covering competencies in state modeling, consistency, caching, and scalable architecture.

  • medium
  • xAI
  • System Design
  • Software Engineer

Design a Rate Limiter with Per-User Token Quotas

Company: xAI

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Onsite

# Design a Rate Limiter with Per-User Token Quotas Design a distributed rate limiter for a high-traffic API platform, with one twist that drives the whole design: **every user can have a different quota**. For example, one premium user may be allowed to consume 10,000 tokens per second, while an ordinary user is limited to 100 tokens per second. The limiter must enforce each user's individual quota, decide allow/deny on every incoming request with minimal added latency, and keep working correctly when the API is served from many machines. Walk through your design end to end: the rate-limiting algorithm, where per-user quotas are stored and how they reach the enforcement path, the data model for live counter state, the request-time decision flow, and how the system scales and degrades. ```hint Algorithm choice A **token bucket with lazy refill** handles heterogeneous quotas naturally: capacity and refill rate are just *parameters of each user's bucket*. Different users get different numbers, not a different algorithm. Refill on read — `tokens = min(capacity, tokens + elapsed * refill_rate)` — so no background timers are needed. ``` ```hint Separate the two planes Split the problem into a **config plane** (the per-user quota definitions: tier defaults plus per-user overrides, changed rarely, cached aggressively) and a **data plane** (the live bucket counters, mutated on every request, kept in a fast shared store like Redis). Most candidates blur these together and get stuck. ``` ```hint Atomicity and hot keys The read-refill-check-decrement sequence must be **atomic** (e.g., a Redis Lua script or compare-and-swap), or concurrent requests will double-spend tokens. And a user allowed $10^4$ requests/second concentrates all that traffic on a single counter key — think about splitting one logical bucket into shards or letting gateways pre-claim token batches locally. ``` ### Constraints & Assumptions - Tens of millions of registered users; a small fraction (say, thousands) have elevated custom quotas — assume a tiered model (free / pro / enterprise) plus per-user overrides. - Aggregate peak on the order of hundreds of thousands to ~1M rate-limit checks per second across the fleet (assumed; confirm with the interviewer). - The rate-limit decision should add no more than ~1-2 ms to request latency. - Quotas are expressed as tokens per second; a single request may cost more than one token (e.g., cost proportional to work). - Quota changes (upgrades, manual overrides) should take effect within seconds to a minute — runtime-changeable, no redeploy. - The API is served by many stateless nodes behind a load balancer; a user's requests may land on any node. ### Clarifying Questions to Ask - Is the limit a hard guarantee (never exceed) or is small, bounded over-admission acceptable in exchange for lower latency? - What should happen when the limiter's backing store is unavailable — fail open (admit traffic) or fail closed (reject)? - Do users need burst headroom (short spikes above the sustained rate), or strict per-second smoothing? - At what layer is the limit enforced — a shared API gateway, a sidecar, or inside each service? - Is enforcement single-region, or must one user's quota be shared globally across regions? - What should a rejected caller receive — an HTTP 429 with retry guidance, queuing, or degraded service? ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - How would you extend the design so a user's quota is enforced globally across multiple regions without adding cross-region latency to every request? - How would you support burst allowances — e.g., a user sustained at 100 tokens/sec who may briefly spike to 1,000 — without changing the algorithm? - A user is upgraded mid-flight from 100 to 10,000 tokens/sec. Trace exactly how and when the new quota takes effect in your design, and what happens to their in-flight bucket state. - How would you rate-limit on multiple dimensions at once (per user, per IP, and per endpoint), and in what order would you evaluate them?

Quick Answer: This question evaluates a candidate's ability to design distributed, low-latency rate limiting with heterogeneous per-user quotas, covering competencies in state modeling, consistency, caching, and scalable architecture.

Related Interview Questions

  • Build a One-Pass Data Cleaning Pipeline - xAI (medium)
  • Design a backend for an online checkers game - xAI (medium)
  • Design a multi-level API rate limiter - xAI (easy)
  • Design a follower push-notification system - xAI (hard)
  • Design a schema for server engagement - xAI (medium)
|Home/System Design/xAI

Design a Rate Limiter with Per-User Token Quotas

xAI logo
xAI
Sep 23, 2025, 12:00 AM
mediumSoftware EngineerOnsiteSystem Design
0
0

Design a Rate Limiter with Per-User Token Quotas

Design a distributed rate limiter for a high-traffic API platform, with one twist that drives the whole design: every user can have a different quota. For example, one premium user may be allowed to consume 10,000 tokens per second, while an ordinary user is limited to 100 tokens per second. The limiter must enforce each user's individual quota, decide allow/deny on every incoming request with minimal added latency, and keep working correctly when the API is served from many machines.

Walk through your design end to end: the rate-limiting algorithm, where per-user quotas are stored and how they reach the enforcement path, the data model for live counter state, the request-time decision flow, and how the system scales and degrades.

Constraints & Assumptions

  • Tens of millions of registered users; a small fraction (say, thousands) have elevated custom quotas — assume a tiered model (free / pro / enterprise) plus per-user overrides.
  • Aggregate peak on the order of hundreds of thousands to ~1M rate-limit checks per second across the fleet (assumed; confirm with the interviewer).
  • The rate-limit decision should add no more than ~1-2 ms to request latency.
  • Quotas are expressed as tokens per second; a single request may cost more than one token (e.g., cost proportional to work).
  • Quota changes (upgrades, manual overrides) should take effect within seconds to a minute — runtime-changeable, no redeploy.
  • The API is served by many stateless nodes behind a load balancer; a user's requests may land on any node.

Clarifying Questions to Ask

  • Is the limit a hard guarantee (never exceed) or is small, bounded over-admission acceptable in exchange for lower latency?
  • What should happen when the limiter's backing store is unavailable — fail open (admit traffic) or fail closed (reject)?
  • Do users need burst headroom (short spikes above the sustained rate), or strict per-second smoothing?
  • At what layer is the limit enforced — a shared API gateway, a sidecar, or inside each service?
  • Is enforcement single-region, or must one user's quota be shared globally across regions?
  • What should a rejected caller receive — an HTTP 429 with retry guidance, queuing, or degraded service?

What a Strong Answer Covers Premium

Follow-up Questions

  • How would you extend the design so a user's quota is enforced globally across multiple regions without adding cross-region latency to every request?
  • How would you support burst allowances — e.g., a user sustained at 100 tokens/sec who may briefly spike to 1,000 — without changing the algorithm?
  • A user is upgraded mid-flight from 100 to 10,000 tokens/sec. Trace exactly how and when the new quota takes effect in your design, and what happens to their in-flight bucket state.
  • How would you rate-limit on multiple dimensions at once (per user, per IP, and per endpoint), and in what order would you evaluate them?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More xAI•More Software Engineer•xAI Software Engineer•xAI System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.