PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/System Design/OpenAI

Design GPU credit allocator

Last updated: Apr 12, 2026

Quick Overview

This question evaluates system design competencies in distributed resource accounting, real-time billing, fault tolerance, rate limiting, and fair scheduling for multi-tenant GPU compute platforms, situated in the System Design domain.

  • hard
  • OpenAI
  • System Design
  • Software Engineer

Design GPU credit allocator

Company: OpenAI

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

##### Question Design a GPU credits allocation system that tracks user GPU consumption, deducts credits in real-time, supports credit top-ups, rate limiting, and fair usage enforcement across multiple GPU nodes.

Quick Answer: This question evaluates system design competencies in distributed resource accounting, real-time billing, fault tolerance, rate limiting, and fair scheduling for multi-tenant GPU compute platforms, situated in the System Design domain.

Solution

# Overview and Key Concepts This design uses a central, strongly consistent credit ledger with short-lived credit "leases" to enable real-time deduction and prevent double-spend across many GPU nodes. Nodes stream usage to a metering pipeline; a reconciler ensures the ledger matches actual usage. Rate limiting and fair usage are enforced at admission and during execution. Key terms: - Credit balance B(u): user's available credits. - Price p(type): price per GPU-second for a GPU type. - Usage rate r = g × p(type), where g is number of GPUs allocated to a job. - Lease window L seconds: the system authorizes up to r × L credits for a job, deducted immediately; extensions renew as the job runs. - Token bucket for rate limits: spend rate and concurrency caps. Small numeric example: - p(H100) = 0.01 credits/GPU-second. - User runs 2 GPUs for 10 minutes = 2 × 600 × 0.01 = 12 credits. - With L = 15 s, each extension deducts 2 × 15 × 0.01 = 0.3 credits. # Architecture Components: 1. Auth & Account Service: user identity, account state, limits. 2. Pricing Service: GPU type → price per GPU-second; effective discounts. 3. Credit Ledger Service (strong consistency): maintains balances, immutable ledger entries, and active holds/leases. Backed by a transactional DB (e.g., PostgreSQL/CockroachDB/Spanner). All monetary mutations are idempotent with request_id. 4. Rate Limiter & Quota Service: per-user token buckets (spend/sec), concurrency limits (max GPUs), and daily caps. Backed by Redis/Redis Cluster with Lua for atomic ops. 5. Scheduler/Admission Controller: decides whether to start/continue jobs based on rate limits and credit leases; integrates with cluster scheduler. 6. Node Agent (on each GPU node): measures actual GPU usage (per process/pod), heartbeats usage every t seconds, requests lease extensions, and enforces throttling/termination if lease cannot be extended. 7. Usage Pipeline: agents send usage events to Kafka/PubSub; an Aggregator normalizes and rolls up usage. 8. Reconciler: reconciles rolled-up usage with provisional charges; posts adjustments to the ledger if needed. 9. Top-up Service: processes payments and applies credits; publishes events for notifications. 10. Observability: metrics, logs, alerts; ledger invariants and drift monitors. # Data Model (simplified) - users(id, status, limits, created_at) - balances(user_id, available_credits, updated_at) — single row per user; strong consistency - ledger_entries(id, user_id, type: [topup|debit|refund|adjustment], amount, request_id, created_at) - leases(id, user_id, job_id, gpu_type, gpu_count, rate_per_sec, expires_at, amount_reserved, status) - usage_events(id, user_id, job_id, node_id, gpu_type, gpu_count, start_ts, end_ts, seconds, measured_utilization, seq) - price_table(gpu_type, price_per_sec, effective_from) - limits(user_id, max_concurrent_gpus, max_spend_per_sec, daily_cap, weight) Notes: - All ledger mutations carry a unique request_id for idempotency. - balances available_credits should be updated only by transactional operations that also insert a ledger_entry. # Real-Time Deduction with Leases Goal: allow jobs on many nodes to consume credits without double-spend and with bounded exposure during failures. Mechanism: 1. Admission: for a new job with gpu_count g and gpu_type t, compute r = g × p(t). Check rate limits and balance. 2. Create lease for window L seconds: atomically deduct r × L from balance and record a lease row with expires_at = now + L. - If balance < r × L, reject or place the job in a waiting queue. 3. Node agent runs the job and sets a heartbeat timer every t (< L) seconds. 4. Extension: each heartbeat attempts to extend the lease by Δ = min(t, L) seconds: - Atomically: deduct r × Δ, push expires_at by Δ, and append a ledger_entry (debit) with request_id = (job_id, seq). - If deduction fails (insufficient funds or rate limit breach), agent is instructed to throttle or gracefully terminate (configurable grace_g seconds using remaining time in lease). 5. Completion: when job ends, close the lease and release any unused reserved time if the implementation pre-reserves more than consumed. Two implementation patterns: - Strict pay-as-you-go: do not pre-reserve beyond the next Δ; deduct each extension immediately. Exposure = r × (grace_g + one extension). - Reserve-then-burn: pre-reserve r × L upfront to reduce extension path latency; burn down the reservation as usage arrives. On completion, refund unused reserved portion. This lowers ledger write QPS but requires careful reconciliation. Atomicity options: - SQL row-level locks: UPDATE balances SET available = available - x WHERE user_id = ? AND available >= x; INSERT ledger_entry ... in the same transaction. - Redis with Lua script: atomic balance check, decrement, and append to a write-ahead queue for persistence. # Rate Limiting Use token bucket and counters, enforced at admission and on lease extension: - Concurrency cap: current_gpus(user) + g ≤ max_concurrent_gpus. - Spend rate cap: per-user bucket with capacity C and fill rate F = max_spend_per_sec. Each extension consumes r × Δ tokens; if unavailable, throttle. - Daily cap: track sum(ledger.debit) for the day; deny new leases once exceeded. Data placement: use Redis hash per user for counters; use TTL keys to track active jobs. Lua scripts ensure atomic increments/decrements. # Fair Usage Across Users When GPUs are scarce, ensure fairness via the scheduler: - Per-user share: weight(u) defines a target fraction of GPUs (e.g., equal weights by default). Scheduler admits jobs so that active_gpus(u)/weight(u) are roughly balanced. - Dominant Resource Fairness (DRF) if CPU/memory also constrained. - Backpressure: if a user is at or above their fair share, queue or preempt low-priority jobs. - Priority classes: paying tiers map to higher weights and possibly preempt lower tiers within policy. Coordination: - Scheduler periodically computes target shares using cluster state and user weights. - Admission Controller consults both: credits (lease), rate limits, and fair-share state before starting a pod. # Failure Modes and Guardrails - Node/Agent crash: no new extensions occur; lease expires at expires_at, bounding overspend to r × (remaining_lease + grace_g). Reconciler may refund any overcharge if job actually stopped earlier. - Ledger outage: allow short offline operation via local budget cache per node (small on-box escrow, e.g., 1–2 extension windows). On reconnect, reconcile and stop jobs if overspent. - Network partitions: leases prevent double-spend; extensions fail when the ledger is unreachable beyond escrow. - Double charge/double spend: prevented via idempotent request_ids and atomic ops. Usage events carry monotonic seq numbers per job to dedupe. - Clock skew: rely on server-side timestamps for leases; agents include both wall and monotonic times for diagnostics but ledger is source of truth. - Price changes: price_table is versioned by effective_from; leases record price at time of debit; reconciler uses the recorded rate. # Reconciliation Pipeline - Agents send usage_events every t seconds and at job end, with measured seconds and GPU type/count. - Aggregator rolls up to per-job intervals. - Reconciler compares rolled-up actual cost vs provisional debits: adjustment = actual_cost - sum(provisional_debits) - If positive: create a debit adjustment (rare if measuring undercount). - If negative: create a refund credit to user. - Drift alerting if |adjustment| exceeds a threshold per job or per user. # APIs (idempotent) 1. POST /topups {amount, request_id} → {new_balance} 2. GET /balance → {available, holds} 3. POST /leases {job_id, gpu_type, gpu_count, window_seconds, request_id} → {lease_id, rate_per_sec, expires_at} 4. POST /leases/{lease_id}/extend {delta_seconds, request_id} → {expires_at} 5. POST /leases/{lease_id}/close {request_id} → {finalized} 6. POST /usage {job_id, seq, gpu_type, gpu_count, start_ts, end_ts} 7. GET /limits → {concurrency, spend_rate, daily_cap} All write APIs require request_id for idempotency; return the same result if retried. # Scalability - Shard ledger by user_id; keep one primary row per user to avoid hot-spotting. - Use append-only ledger_entries and periodic compaction/snapshots for fast balance reads. - Batch extensions: coalesce per-job debits up to a maximum Δ (e.g., 5–15 s) to reduce write QPS. - Redis Cluster for rate limiter; key hashing on user_id to ensure locality. - Kafka partitions by user_id for usage events to maintain order per job. # Security and Integrity - Signed lease tokens (JWT) returned to nodes include lease_id, rate, expires_at; agents cannot mint tokens. - mTLS between agents and control plane. - Least-privilege service accounts; WAF on public APIs. # Example Walkthrough Given: User balance B = 50 credits, p(H100)=0.01, job uses g=4 GPUs. - r = 4 × 0.01 = 0.04 credits/s. - L = 15 s, Δ = 5 s heartbeats. Flow: 1. Admission: concurrency and spend_rate allow. Create lease, deduct 0.04 × 15 = 0.6 credits. New balance: 49.4. 2. After 5 s: extend by 5 s, deduct 0.2 credits → balance 49.2. 3. Repeat until job ends. If job runs 8 minutes (480 s): total cost = 0.04 × 480 = 19.2 credits. Ledger shows ~96 debits of 0.2 plus initial 0.6, minus refund of unused reserved portion if applicable. 4. If balance would drop below required extension, agent gets NACK; it uses remaining lease time and then stops. Max overspend bounded by one Δ plus any grace_g. # Testing and Validation - Unit tests for atomic operations and idempotency with concurrent requests. - Fault injection: kill ledger nodes, partition networks, kill agents; verify bounded exposure and reconciliation. - Property tests for ledger invariants: sum(ledger) == balances, no negative balances, request_id uniqueness. - Load tests: simulate 10k jobs with Δ=5 s; ensure ledger write QPS and tail latencies meet SLOs. # Alternatives and Trade-offs - Postpaid billing: simpler runtime but requires credit risk management; unsuitable if strict prepay is required. - Longer leases reduce write QPS but increase exposure; choose L based on risk appetite and latency. - Full central scheduler fairness vs. per-node local fairness: central is more accurate; local is more resilient. # Summary Use a strongly consistent credit ledger with short, renewable leases to gate execution and deduct in near real time. Enforce per-user rate limits and fair-share at admission and during execution. Stream usage to an immutable pipeline and reconcile to ensure correctness and auditability. Bound overspend via lease expiry and small on-node escrows, and harden with idempotent APIs, atomic mutations, and robust observability.

Related Interview Questions

  • Design a Distributed Rate Limiter - OpenAI
  • Design a Distributed Crossword Solver - OpenAI (medium)
  • Design Mobile Model Usage Quotas - OpenAI (medium)
  • Design a Slack-Like Messaging System - OpenAI (medium)
  • Design a Real-Time Chess Service - OpenAI (medium)
OpenAI logo
OpenAI
Aug 4, 2025, 10:55 AM
Software Engineer
Technical Screen
System Design
192
0

System Design: GPU Credits Allocation and Fair Usage

Context

You are designing a multi-tenant platform that provides access to GPU compute across many nodes. Users pre-purchase credits and are charged based on GPU usage (e.g., per GPU-second). The system must track consumption in near real time, prevent overspending, support credit top-ups, and enforce fair usage and rate limits across the fleet.

Assume:

  • Multiple GPU types (A100, H100, etc.) with different prices.
  • Jobs can run on one or more GPUs and can migrate or be rescheduled.
  • The system must continue operating under node/agent failures and network partitions with bounded exposure.

Requirements

  1. Track per-user GPU consumption across nodes and time.
  2. Deduct credits in real time (seconds-level), preventing double-spend across nodes.
  3. Support credit top-ups (payments) and immediate balance visibility.
  4. Enforce rate limits: e.g., max concurrent GPUs, spend rate per second, daily caps.
  5. Enforce fair usage across users (no one user can starve others) when resources are scarce.
  6. Fault tolerance: handle node/agent/ledger outages; guarantee at-most-bounded overspend.
  7. Auditable ledger: idempotent, immutable records; reconcile provisional vs final charges.
  8. APIs for balance, reserve/authorize, consume, top-up, and usage reporting.
  9. Scalability to thousands of nodes and tens of thousands of concurrent jobs.

Deliverables

  • High-level architecture with key components and their responsibilities.
  • Data model for accounts, balances, holds/leases, usage events, and pricing.
  • Real-time deduction mechanism across multiple nodes (prevent double-spend).
  • Rate limiting and fair scheduling approach.
  • Failure handling and reconciliation strategy.
  • Small numeric example to illustrate charging and limits.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More OpenAI•More Software Engineer•OpenAI Software Engineer•OpenAI System Design•Software Engineer System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.