PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/OpenAI

Design GPU credit allocator

Last updated: Jun 15, 2026

Quick Overview

A multi-tenant GPU credit allocation system design (OpenAI software-engineer system-design screen): track and deduct GPU credits in real time across many nodes, issue/transfer/spend credits via idempotent APIs, and enforce budgets, hierarchical quotas, rate limits, and fair-share with scheduler integration. Covers prepaid vs postpaid, an immutable double-entry ledger with reservations, overspend prevention, failure recovery, scaling, and observability.

  • hard
  • OpenAI
  • System Design
  • Software Engineer

Design GPU credit allocator

Company: OpenAI

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

##### Question Design a GPU credit allocation system for a multi-tenant compute platform (e.g., OpenAI's GPU infrastructure). Organizations are granted GPU credits (for example, a monthly allowance), and users run jobs that consume those credits across many GPU nodes. Your design should cover: 1. Track real-time GPU consumption per job and deduct credits as work is performed, across multiple GPU nodes. 2. Define APIs to issue/grant credits, transfer credits between scopes (org -> project -> user), top up credits, query balances, and spend/settle credits for jobs. 3. Enforce budgets, per-user/per-project quotas, and rate limits in real time under high concurrency, preventing double-spend and overspend. 4. Integrate with a job scheduler / admission controller to admit, throttle, or reject workloads based on available credits and limits. 5. Support both prepaid and postpaid billing models. 6. Enforce fair usage when GPUs are scarce (fair-share across users/orgs). 7. Address idempotency, the consistency model, data model, failure recovery, scaling (partitioning, caching), audit logging/reporting, and observability/alerting.

Quick Answer: A multi-tenant GPU credit allocation system design (OpenAI software-engineer system-design screen): track and deduct GPU credits in real time across many nodes, issue/transfer/spend credits via idempotent APIs, and enforce budgets, hierarchical quotas, rate limits, and fair-share with scheduler integration. Covers prepaid vs postpaid, an immutable double-entry ledger with reservations, overspend prevention, failure recovery, scaling, and observability.

Solution

## GPU Credit Allocator — Design ### 1. Problem framing, assumptions, and scale We are building the **metering, budgeting, and admission** layer that sits between a multi-tenant GPU fleet and the jobs that run on it. Credits are an internal unit of account (1 credit = a fixed amount of GPU-time at a reference price). The system must deduct credits as work happens across thousands of nodes, never let a tenant overspend or double-spend, and tell the scheduler whether a job may start, continue, throttle, or stop. **Assumptions I'll state up front (and would confirm with the interviewer):** - Account hierarchy is **org → project → user**. Credits live at any scope; a job is attributed to a `(user, project, org)` triple. - A *job* occupies `g` GPUs of type `t` for some duration. Billing granularity is GPU-seconds. - Target scale (illustrative, to size the design): ~10⁵ concurrent jobs, ~10⁵–10⁶ GPUs, tens of thousands of orgs. Heartbeat interval `t = 15–60 s`. - **Money correctness > availability for writes.** A reservation/spend must be strongly consistent. Balance *reads* for dashboards may be slightly stale. Metering may be eventually consistent and reconciled. The one explicit, bounded exception to write-strictness is the on-node escrow during a ledger outage (§12) — disclosed there as a deliberate availability tradeoff, not a hidden one. **Write-QPS back-of-envelope (justifies the batching/partitioning choices below):** 10⁵ jobs heartbeating every 30 s ≈ **3,300 settlement writes/s**. With per-org sharding and per-job coalescing this is easily absorbed; without batching, a naive per-second deduction would be ~10⁵/s and would hot-spot the ledger. This number is why we coalesce. ### 2. Core idea 1. **An immutable, append-only ledger is the single source of truth, and we apply double-entry bookkeeping to it.** Balances are a *derived* materialized view of the ledger. This gives auditability and makes drift detectable. **What "double-entry" means precisely here** matters, because two different kinds of movement live in this system: - *Genuine two-account moves* — `issue`, `topup`, `transfer` — debit one account and credit another (e.g. `system → org`, `org → project`). These are textbook double-entry: `debit_account_id ≠ credit_account_id`. - *Intra-account state transitions* — `reserve`, `settle`, `release`, `refund` — move credit between the **available / reserved / spent buckets of a single account**. To keep these honest double-entry rather than ad-hoc column edits, we model the three buckets as **contra sub-accounts of the same account** (`acct:available`, `acct:reserved`, `acct:spent`). Then `reserve` is `debit acct:available, credit acct:reserved`; `settle` is `debit acct:reserved, credit acct:spent`. Each is a balanced pair, so `sum(debits) = sum(credits)` holds ledger-wide for *every* entry type, and the §13 audit `sum(ledger) == balances` is a true double-entry trial balance, not just a checksum. The `balances` row's `available / reserved / spent` columns are simply the cached projection of those three contra sub-accounts; the column mutation shown in §7 is the *fast path* that updates the projection, while the paired ledger entry is what makes it auditable. 2. **Short-lived reservations (leases/holds) gate execution.** Before a job runs we move credits `available → reserved`. As it runs we settle `reserved → spent` in small slices. Bounded lease windows cap the worst-case loss if a node or service crashes. 3. **Budgets, quotas, rate limits, and fair-share are separate controls** checked at admission and on every lease extension. They answer different questions: - *Budget* (ledger balance): do you have credits? - *Quota* (hierarchical caps): are you within your org/project/user allotment and concurrency cap? - *Rate limit* (token bucket): are you spending too fast / bursting? - *Fair-share* (scheduler): when GPUs are scarce, is it your turn? A job needs **all four** to pass. Critically, fair-share is *orthogonal* to credits: a tenant can have credits but be throttled for fairness, or be at fair-share but blocked for lack of credits. **The invariant that makes overspend auditable** (per account per period), expressed over the three buckets: $$\text{available} + \text{reserved} + \text{spent} = \text{issued} + \text{topups} \;(+\; \text{credit\_limit for postpaid})$$ Any reconciliation or audit checks this holds and that `available ≥ 0` for prepaid (or `available ≥ -credit_limit` for postpaid). This makes overspend impossible to hide *on the authoritative ledger path*; the only way credit can be consumed without first clearing this check is the disclosed on-node escrow window (§12), which is itself bounded and reconciled. ### 3. Key quantities - `p(t, region)` — price in credits per GPU-second for GPU type `t`. **Versioned** (`effective_from`/`effective_to`) and **pinned onto the reservation** so a mid-job price change can't retroactively alter a running job. - Job spend rate: $r = g \times p(t)$ credits/second. - Lease window `L` seconds; heartbeat interval `t < L`. On admission we hold $r \times L$; each heartbeat settles the slice actually consumed and re-extends the lease. *Example:* `p(H100) = 0.01` credits/GPU-s, 2 GPUs for 10 min ⇒ $2 \times 600 \times 0.01 = 12$ credits. With `L = 30 s`, each extension settles up to $2 \times 30 \times 0.01 = 0.6$ credits. ### 4. Architecture ``` ┌──────────────────────────┐ Admin / Billing ──▶ │ Issuance / Top-up Svc │──┐ └──────────────────────────┘ │ ledger writes ┌──────────────────────────┐ ▼ Scheduler / │ Credit Ledger Service │ (strong-consistent SQL, Admission Ctrl ◀──▶ │ • balances (derived) │ sharded by org_id) ▲ │ • ledger_entries (immut.)│ │ admit/deny │ • reservations (TTL) │ │ └──────────────────────────┘ │ ┌──────────────────────────┐ ├────────────▶ │ Quota & Rate-Limit Engine│ (Redis Cluster + Lua) │ └──────────────────────────┘ │ ┌──────────────────────────┐ └────────────▶ │ Pricing Service │ (versioned prices) └──────────────────────────┘ ┌───────────┐ usage ┌──────────────┐ rollup ┌─────────────┐ adjust │ Node Agent│───────▶ │ Usage Pipeline│─────────▶│ Reconciler │───────▶ ledger │ (per node)│ events │ (Kafka + agg) │ └─────────────┘ └───────────┘ ┌─────────────┐ ▲ lease extend/deny │ Event bus → │ → BI / invoices / └──────────────────────────────────────────│ Analytics │ dashboards └─────────────┘ ``` - **Node agent** (one per node): measures real GPU usage per pod, heartbeats it, requests lease extensions, and **throttles or terminates the job** if an extension is denied. It is the local enforcement point. - **Usage pipeline → reconciler**: agents emit measured usage to Kafka; an aggregator rolls it up per job; the reconciler compares *measured* cost to *provisionally settled* cost and posts adjustment/refund entries. This catches metering vs. billing drift. - **Observability** spans all of it (see §14). ### 5. Data model Strong-consistent SQL store (PostgreSQL / CockroachDB / Spanner), sharded by `org_id`. ```sql accounts(account_id PK, scope_type ENUM(org|project|user), scope_ref, parent_account_id FK, status, created_at) budgets(account_id, period_start, period_end, allowance, model ENUM(prepaid|postpaid), credit_limit, rollover_policy) -- ONE derived row per (account, period); version = OCC token. -- available/reserved/spent are the cached projection of the three -- contra sub-accounts (acct:available / acct:reserved / acct:spent). balances(account_id, period, available, reserved, spent, version) -- immutable, append-only, double-entry. -- For reserve/settle/release/refund the debit/credit accounts are the -- contra sub-accounts of ONE business account (see §2); for -- issue/topup/transfer they are two distinct business accounts. ledger_entries(entry_id PK, ts, debit_account_id, credit_account_id, amount, type ENUM(issue|topup|transfer|reserve|release|settle|refund|expire|adjust), job_id, reservation_id, idempotency_key UNIQUE, metadata jsonb) reservations(reservation_id PK, account_id, job_id, gpu_type, gpu_count, rate_per_sec, amount_held, amount_settled, state ENUM(active|settled|canceled|expired), expires_at, idempotency_key UNIQUE) usage_events(id PK, account_id, job_id, node_id, gpu_type, gpu_count, start_ts, end_ts, seconds, measured_util, seq) -- seq dedupes prices(gpu_type, region, credits_per_sec, effective_from, effective_to) quotas(account_id, dimension, limit_type ENUM(concurrent_gpus|concurrent_jobs| credits_per_period), limit_value, window) -- token-bucket state for the §9 rate limiter (hot copy lives in Redis; -- this is the durable backing / source of truth for params + recovery) rate_limits(scope_key PK, capacity, refill_per_sec, tokens, last_refill_ts, version) audit_logs(audit_id PK, ts, actor, action, resource, old_value, new_value, request_id) ``` `balances` is a materialized projection of `ledger_entries`; it exists purely so admission doesn't have to sum the ledger on the hot path. The ledger remains authoritative. The `rate_limits` table persists each bucket's capacity `B`, refill `ρ`, and last-known token count so a Redis flush or node loss can rehydrate the limiter rather than reset every tenant's burst budget to full. ### 6. APIs (every write is idempotent via `Idempotency-Key`) | Endpoint | Purpose | |---|---| | `POST /v1/credits/issue` | Monthly issuance / admin grant (`system → account`) | | `POST /v1/topups` | Paid top-up | | `POST /v1/credits/transfer` | Move credits org→project→user (double-entry, see §8) | | `POST /v1/reservations` | Admission hold; returns `reservation_id`, `rate_per_sec`, `expires_at` | | `POST /v1/reservations/{id}:extend` | Heartbeat: settle the elapsed slice + re-extend the lease | | `POST /v1/reservations/{id}:settle` | Final settlement on job end; refund unused hold | | `POST /v1/reservations/{id}:cancel` | Release the hold (job never started) | | `POST /v1/usage` | Metering ingest (async, reconciled) | | `GET /v1/accounts/{id}/balance` | `{available, reserved, spent}` (may be replica-stale) | | `GET /v1/accounts/{id}/usage` / `…/limits` | Reporting | | `PUT /v1/accounts/{id}/budgets` / `…/quotas` | Policy management | | `GET /v1/audit` | Audit log query | Duplicate `Idempotency-Key` returns the original result without re-applying the effect. ### 7. Real-time deduction via reservations + streaming settlement This is the heart of the design. The protocol per job: ``` ADMIT(job): r = gpu_count * price(gpu_type, region) # pin price now if not (quota_ok and rate_limit_ok and fair_share_ok): reject hold = r * L reservation = RESERVE(account, hold) # available -= hold; reserved += hold if reservation == INSUFFICIENT: reject (or queue / grant partial) return signed_lease(reservation_id, r, expires_at = now + L) HEARTBEAT(reservation, elapsed_since_last): # every t < L slice = r * elapsed_since_last SETTLE(reservation, slice) # reserved -= slice; spent += slice # idempotency_key = (job_id, seq) if budget/quota/rate still OK: EXTEND expires_at = now + L # top the hold back up to r*L else: deny extension -> agent throttles or drains within remaining lease + grace END(job): SETTLE final actual slice REFUND unused hold (reserved -> available) # close reservation ``` **Why there's no double-spend and no double-count:** - `RESERVE` moves `available → reserved` (i.e. `debit acct:available, credit acct:reserved`) — it does **not** decrement total credits, it earmarks them. - `SETTLE` moves `reserved → spent` (`debit acct:reserved, credit acct:spent`). The total credit available to *spend further* only ever decreases once, at settlement. The upfront hold and the per-slice settle act on different buckets, so reserving `r·L` and then settling slices does **not** charge twice (a subtle bug the naive "burn on reserve *and* on heartbeat" design hits). - Each settlement carries `idempotency_key = (job_id, seq)`, so a retried heartbeat is a no-op. **Atomic reserve (SQL, OCC + guard) — fast-path projection plus the paired ledger entry:** ```sql UPDATE balances SET available = available - :hold, reserved = reserved + :hold, version = version + 1 WHERE account_id = :acct AND period = :p AND available >= :hold -- prepaid; postpaid: available + credit_limit >= hold AND version = :v; -- 0 rows ⇒ retry / reject -- same transaction (the auditable double-entry pair backing the projection above): INSERT INTO ledger_entries(..., type='reserve', debit_account_id = :acct||':available', credit_account_id = :acct||':reserved', idempotency_key=:k) ; INSERT INTO reservations(..., state='active', amount_held=:hold, idempotency_key=:k) ; ``` The `UPDATE` keeps the projection cheap; the `INSERT` keeps it honest. Under contention, `SELECT ... FOR UPDATE` on the single balance row serializes writers within the shard. For the lowest-latency path, the hot rate-limit/quota check can run in Redis with a Lua script and a durable write-ahead, while the authoritative reserve always lands in SQL. **Two operating points:** - *Strict pay-as-you-go* — hold only the next slice. Worst-case exposure ≈ `r × (one slice + grace)`. More ledger writes. - *Reserve-then-burn* — hold `r·L` upfront, settle down as usage arrives, refund the remainder. Fewer extension round-trips and lower latency; the trade is more in-flight reserved credit and more care in reconciliation. ### 8. Transfers (org → project → user) A transfer is genuine two-account double-entry: debit the source account, credit the destination, one transaction. Within a shard (same org) it's a local ACID write. Cross-shard transfers (rare) need a distributed protocol, and the two options are *not* the same thing: - **Saga with an escrow account (preferred here):** a sequence of independently-committed, compensatable local transactions — `debit source → credit escrow` (shard A), then `debit escrow → credit destination` (shard B). Each step is idempotent; if step 2 fails we run the compensating `credit source ← debit escrow`. No global lock is held; the escrow account makes any partial state a *valid balanced ledger state* rather than lost credit. This is what the 3-step flow actually is. - **2PC (the blocking alternative):** a coordinator runs `prepare` on both shards, then `commit`/`abort`. It is atomic without compensation but holds locks across the prepare window and stalls if the coordinator dies — generally undesirable on the money path at this scale. We mention it only as the contrast; we'd choose the saga. We keep org sub-accounts co-located on one shard precisely so the common case stays single-shard and neither distributed protocol is needed. ### 9. Quotas, rate limits, hierarchical budgets Checked at admission **and** on every extension, top-down org → project → user: - **Concurrency caps:** `current_gpus(account) + g ≤ max_concurrent_gpus`; also max concurrent jobs. Counters live in Redis with per-job TTL keys so a crashed agent's slot self-releases. - **Period/spend caps:** a `credits_per_period` quota independent of the raw balance (e.g., "this user may spend ≤ 200/day even though the org has 10k"). - **Rate limit (burst guard):** a token bucket per scope, capacity `B`, refill `ρ` tokens/s; an extension must draw `slice` tokens. Bucket params and live token count are persisted in `rate_limits` (§5) and served hot from Redis. **This is deliberately separate from the budget** — its job is to damp bursts and protect the ledger/scheduler from a single tenant hammering, *not* to encode the monthly allowance. Tune `B` and `ρ` for burst tolerance and infra protection; the monthly allowance is enforced by the ledger balance, not by the bucket's refill rate. (Conflating the two — setting `ρ = allowance/period` — makes the bucket double as a budget and breaks both jobs.) Split hot org-level buckets from per-user buckets, and shard Redis by `account_id`, to avoid hot keys. ### 10. Prepaid vs. postpaid - **Prepaid:** `available = issued + topups − reserved − spent`. Reject any reserve that would push `available < 0` (optionally a tiny overdraft buffer to allow a clean shutdown slice). - **Postpaid:** raise the floor by the credit limit — `available = issued + topups + credit_limit − reserved − spent`, allowing balance down to `−credit_limit`. Risk controls: per-job cap, daily cap, anomaly detection, and auto-pause on delinquency. The §2 invariant holds for both; only the floor changes. ### 11. Fair-share when GPUs are scarce When demand exceeds supply, having credits is necessary but not sufficient — the scheduler decides *whose* job runs: - **Weighted max-min fair share.** Each account has weight `w(u)`; the scheduler admits/preempts so that `active_gpus(u) / w(u)` stays balanced across competing tenants. Equal weights ⇒ equal shares; paid tiers map to higher weights. - **DRF (Dominant Resource Fairness)** is the generalization when jobs contend on multiple resources (GPU + CPU + memory): fair-share is computed on each tenant's *dominant* resource. - **Backpressure & preemption:** queue or preempt low-priority jobs from tenants above their fair share; respect priority classes within policy. - The admission controller consults **credits AND quota AND rate limit AND fair-share** before binding a pod — fairness can throttle a credit-rich tenant, and a credit-poor tenant never runs regardless of fairness. ### 12. Overspend guardrails, idempotency, consistency, failure recovery **Guardrails:** short lease TTLs (`2–5 min`) with rolling heartbeats; small-slice streaming settlement (so a crash loses at most one slice + grace); per-job `max_amount` and per-scope daily caps; node-agent termination on denied extension; price pinned on the reservation. **Idempotency:** `UNIQUE idempotency_key` on `ledger_entries` and `reservations` gives exactly-once under client retries (duplicate key ⇒ return prior result). Heartbeats key on `(job_id, seq)`; final settlement on `job_id:final`. Usage events carry a monotonic `seq` to dedupe. **Consistency model:** strong/serializable for `reserve | settle | transfer | issue`; per-org sharding gives single-writer semantics within a shard. Reporting and metering are eventually consistent via the event bus; admission **never** reads a stale replica. **Failure recovery:** | Failure | Behavior | |---|---| | Client retry | Safe — idempotent. | | Reservation expiry | Sweeper cancels expired holds, posts a compensating `release` entry; loss bounded to `r × (remaining_lease + grace)`. | | Node / agent crash | No heartbeats ⇒ lease expires ⇒ job preempted; reconciler refunds any overcharge. | | Service crash mid-write | Ledger entry + balance update are one transaction — both commit or neither. | | Ledger outage | Brief offline operation against a small on-node escrow (1–2 lease windows); reconcile and stop jobs on reconnect if overspent. | | Network partition | Leases cap exposure; extensions fail past the escrow. | | Clock skew | Server timestamps are authoritative for TTL; client times rejected. | **The on-node escrow is an explicit, bounded availability-over-strictness tradeoff.** During a ledger/partition outage a node may keep a job alive against ≤ 1–2 lease windows of pre-granted escrow. This *can* spend prepaid credit the node cannot re-confirm is still available, so it is a **bounded, disclosed overspend window**, not zero — it is the one place the otherwise-strict "no spend without an authoritative check" rule is relaxed for liveness. The exposure is capped at `r × escrow_windows`, the escrow is pre-deducted from `available` when granted (so it can't be double-counted), and on reconnect the reconciler trues up and pauses any job whose escrow exceeded its remaining balance. Teams that want hard strictness can set `escrow_windows = 0`, trading liveness during an outage for zero overspend. **Reconciliation:** the reconciler computes `adjustment = measured_cost − provisionally_settled` per job using the reservation's pinned price; posts `adjust`/`refund` entries; alarms when `|drift|` exceeds a threshold. A periodic audit asserts the §2 invariant and the double-entry trial balance `sum(debits) == sum(credits)` and `sum(ledger projection) == balances`. ### 13. Scaling, partitioning, caching - **Shard ledger/balances by `org_id`**, co-locating all of an org's sub-accounts so transfers and contention stay local. One primary balance row per account avoids hot-spotting. - **Coalesce heartbeats** (one settlement per 30 s window rather than per-second) — this is what turns the ~10⁵/s naive write rate from §1 into a few thousand/s. - **Append-only ledger** with periodic snapshots/compaction for fast balance reconstruction. - **Read-through cache for balances — dashboards only, and explicitly *not* a correctness mechanism.** Admission **always** bypasses the cache and reads the authoritative row under OCC; only stale-tolerant reads (dashboards, reporting) hit the cache. We store the balance's `version` *inside* the cached value (not as the lookup key) and TTL the entry, so a reader sees how fresh the snapshot is and the write path can bump/evict on commit. Putting `version` in the *key* would be self-defeating — you'd have to read the authoritative row to learn the current version before you could form the key, which defeats the cache; so the cache is keyed by `(account_id, period)` and carries `version` as data for staleness display and invalidation, never as a guard the read path relies on for correctness. - **Redis Cluster** for quotas/rate limits, key-hashed on `account_id`; **Kafka** partitioned by `org_id`/`job_id` to preserve per-job ordering. ### 14. Audit, observability, alerting, security - **Audit:** immutable `ledger_entries` + `audit_logs` give a complete, replayable history; every monetary event is reconstructable as a balanced double-entry pair. Event bus feeds invoices, BI, and per-org/project/user reports. - **Metrics:** reservations granted/denied (with reason), spend rate, refunds/adjustments, token utilization & throttle rate, API P50/P95, idempotency-conflict rate, sweeper backlog, reconciliation drift. - **Tracing:** scheduler → credit service → DB; structured logs carry `request_id`, `idempotency_key`, `account_id`, `job_id`, decision reason. - **Alerts:** denial-rate spikes, ledger lock-wait / write-latency elevation, sweeper backlog growth, accounts near depletion or over `credit_limit`, reconciliation drift over threshold, and any audit run where `sum(debits) ≠ sum(credits)` (a ledger integrity breach). - **Security/integrity:** signed lease tokens (carry `reservation_id`, `rate`, `expires_at` — agents can't mint them); mTLS agent↔control-plane; least-privilege service accounts. ### 15. Worked end-to-end example Setup: Org allowance 10,000 credits/month (prepaid); Project P cap 6,000; User U cap 200/day, concurrent_jobs ≤ 2; A100 = 1 credit/min. 1. Issuance: `+10,000` to Org; transfer `+6,000` Org→Project P (two-account double-entry). 2. U submits two 1-GPU jobs (expected 60 min). Day cap 200 ⇒ both admitted; each reserves a rolling `r·L` hold (`r = 1 credit/min`, posted as `acct:available → acct:reserved`). 3. Heartbeats settle 1 credit/min each (`acct:reserved → acct:spent`). 4. Job A runs 50 min → settles 50, refunds the unused hold. Job B runs 70 min → at min 60 it requests an extension; user remaining = `200 − 50 − 60 = 90 ≥ 0` and Project P has budget ⇒ granted; settles 70 total. 5. Final: Project P `spent = 120`; Org `available` down 120; audit shows two reservations, streamed settlements, one extension, refunds, and a balanced trial balance. Invariant `available + reserved + spent = issued + topups` holds throughout. ### 16. Testing and validation Correctness here is a property of the *ledger*, so we test it as one: - **Property / invariant tests.** For randomized sequences of issue/topup/transfer/reserve/settle/refund/expire, assert after every step: (a) the double-entry trial balance `sum(debits) == sum(credits)`; (b) the §2 bucket invariant `available + reserved + spent == issued + topups (+credit_limit)`; (c) `available ≥ 0` (prepaid) / `≥ −credit_limit` (postpaid); (d) `sum(ledger projection) == balances`; (e) every `idempotency_key` is unique and a replayed request is a no-op. - **Concurrency tests.** Hammer a single account with parallel reserve/extend/settle under the OCC + `FOR UPDATE` path and assert no lost update, no double-spend, no double-count (each heartbeat net-decrements `available` by exactly one slice). - **Fault injection.** Kill the ledger mid-write (assert all-or-nothing), kill agents (assert lease expiry + sweeper release + bounded `r × (remaining_lease + grace)` exposure), and partition the ledger to exercise the on-node escrow (assert exposure ≤ `r × escrow_windows` and that reconnect reconciliation trues up and pauses overspent jobs). - **Load tests against the §1 write-QPS SLO.** Drive ~10⁵ heartbeating jobs with 30 s coalescing and verify settlement write throughput and tail latency stay within budget, and that admission P95 does not regress under contention. - **Reconciliation tests.** Feed measured usage that diverges from provisional settlement and assert the reconciler posts the correct `adjust`/`refund` entries against the *pinned* price and raises a drift alarm past threshold. ### 17. Trade-offs and pitfalls - **Lease length:** longer `L` cuts write QPS but raises crash exposure — pick by risk appetite vs. latency. - **Postpaid** is simpler at runtime but needs credit-risk management; **prepaid** is stricter but harder UX on exhaustion. - **On-node escrow** buys liveness through a ledger outage at the cost of a bounded overspend window; `escrow_windows = 0` flips that tradeoff back to hard strictness. - **Central** scheduler fairness is more accurate; **per-node** local fairness is more resilient under control-plane outage. - **Pitfalls:** long-lived holds that tie up budget and balloon crash exposure; treating `version` as a cache *key* (self-defeating) instead of cached data; cross-shard transfers without saga/escrow guarantees; relying on budgets alone with no rate limit (bursty overspend); retroactive price changes without a pinning policy; the double-count trap of charging on both reserve and settle; and writing reserve/settle as bare column edits without the paired contra-account ledger entry (which silently breaks the audit trail). ### Summary A strongly consistent, immutable, append-only **double-entry ledger** is the source of truth — genuine two-account moves for issue/topup/transfer, and contra-sub-account pairs (`available`/`reserved`/`spent`) for the reservation lifecycle, so every entry balances and the audit is a real trial balance. **Short renewable leases with streaming settlement** deduct credits across many nodes with bounded crash exposure and no double-spend. **Budgets (ledger), quotas (hierarchical caps), rate limits (persisted token buckets), and fair-share (scheduler)** are four independent gates, all checked at admission and on every extension. Prepaid and postpaid differ only in the balance floor. Idempotent APIs, atomic OCC mutations, lease-expiry sweeping, a bounded-and-disclosed on-node escrow, a reconciliation pipeline, and an invariant-driven test suite keep it correct, auditable, and scalable under failure.

Explanation

Rubric: the strongest answers anchor on a strongly consistent, immutable double-entry ledger as the source of truth and use short-lived reservations/holds with streaming settlement to deduct credits in real time without double-spend or overspend. Look for: idempotent APIs (issue/transfer/top-up/reserve/extend/spend) with unique keys; hierarchical quotas (org -> project -> user) and token-bucket rate limits enforced at admission and during execution; scheduler/admission integration; prepaid vs postpaid models; fair-share scheduling (weights/DRF) when GPUs are scarce; concrete failure handling (lease expiry sweeper, on-node escrow, reconciliation, bounded exposure); and scaling via org-sharding, caching with versioning, and partitioned usage streams, plus audit logging and observability.

Related Interview Questions

  • Design Video Generation Orchestration - OpenAI (medium)
  • Design CI/CD Build Caching - OpenAI
  • Design an Instagram-like Feed System - OpenAI (medium)
  • Design Online Chess Matchmaking - OpenAI (hard)
  • Design Android MVVM API Architecture - OpenAI (medium)
|Home/System Design/OpenAI

Design GPU credit allocator

OpenAI logo
OpenAI
Aug 4, 2025, 10:55 AM
hardSoftware EngineerTechnical ScreenSystem Design
277
0

Design a GPU Credit Allocation System

Design a GPU credit allocation system for a multi-tenant compute platform (for example, OpenAI's GPU infrastructure).

Scenario. Organizations are granted GPU credits (e.g., a monthly allowance). Users submit jobs that run across many GPU nodes, and those jobs consume credits as they execute. Your system is the metering, budgeting, and admission layer that tracks consumption, enforces limits, and decides whether each workload may run.

Walk through your architecture, data model, key APIs, and the consistency and failure-handling decisions behind it.

Clarifying Questions to Ask

Before designing, scope the problem with the interviewer:

  • Account hierarchy. How are tenants structured — is it org → project → user, and can credits live at any scope, or only at the top? How is a job attributed to an account?
  • Unit of account. What is a "credit" — a fixed amount of GPU-time at a reference price, or a currency-pegged unit? Does pricing differ by GPU type (e.g., H100 vs. A100) and region?
  • Scale. Roughly how many concurrent jobs, GPUs, and orgs? What heartbeat/metering granularity is expected (per-second, per-minute)?
  • Correctness vs. availability. Is it acceptable for balance reads (dashboards) to be slightly stale, while spends must be strongly consistent? How much overspend, if any, is tolerable during an outage?
  • Billing semantics. Do we need both prepaid (hard stop at zero) and postpaid (credit-limit / overdraft)? What happens to unused credits at period end (rollover/expiry)?
  • Fairness expectations. When GPUs are scarce, should having credits guarantee a job runs, or can a credit-rich tenant still be throttled for fairness?

Constraints & Assumptions

Anchor the design with explicit numbers (state your own if the interviewer doesn't pin them):

  • Account scopes: org → project → user ; a job is attributed to a (user, project, org) triple.
  • Billing granularity is GPU-seconds ; a job occupies g GPUs of type t for some duration.
  • Illustrative target scale: ~ 10510^5105 concurrent jobs, ~ 10510^5105 – 10610^6106 GPUs, tens of thousands of orgs.
  • Metering heartbeat interval on the order of 15–60 s .
  • Spends must be strongly consistent ; balance reads for dashboards may be slightly stale.

What your design must cover

  1. Real-time metering and deduction. Track GPU consumption per job and deduct credits as work is performed, across multiple GPU nodes.
  1. Credit APIs. Define APIs to:
    • Issue/grant credits.
    • Transfer credits between scopes (org → project → user).
    • Top up credits.
    • Query balances.
    • Spend/settle credits for jobs.
  1. Limits under high concurrency. Enforce budgets, per-user and per-project quotas, and rate limits in real time, preventing double-spend and overspend even under high concurrency.
  1. Scheduler integration. Integrate with a job scheduler / admission controller to admit, throttle, or reject workloads based on available credits and limits.
  1. Billing models. Support both prepaid and postpaid billing.
  1. Fair usage under scarcity. When GPUs are scarce, enforce fair usage ( fair-share across users and orgs).

Cross-cutting concerns to address

For each of the above, be ready to discuss:

  • Idempotency and the consistency model .
  • Data model .
  • Failure recovery .
  • Scaling (partitioning, caching).
  • Audit logging / reporting .
  • Observability / alerting .

What a Strong Answer Covers

The interviewer is looking for these dimensions (not the answers themselves):

  • A clear source of truth and consistency story — what is authoritative, what is derived, and where strong consistency is required vs. where eventual consistency is acceptable.
  • A real-time deduction mechanism that works across many nodes without charging twice or losing credit, with bounded crash exposure.
  • A coherent data model spanning accounts/hierarchy, balances, the ledger, reservations, usage, pricing, quotas, and rate limits.
  • A complete, idempotent API set covering issuance, transfer, top-up, balance query, and the spend/settle lifecycle.
  • Four distinct controls kept distinct — budget, quota, rate limit, and fair-share — and where each is checked.
  • Prepaid vs. postpaid handled with minimal divergence and explicit credit-risk controls.
  • Failure handling — retries, lease expiry, node/service crash, partition, clock skew — with the tradeoffs stated rather than hidden.
  • Scaling choices — partitioning/sharding key, caching (and why the cache is not a correctness mechanism), and stream partitioning to preserve ordering.
  • Audit, observability, and alerting — what is logged, what metrics matter, and what alerts on integrity breaches.
  • A capacity estimate that justifies the batching/partitioning decisions, and explicit discussion of trade-offs and pitfalls.

Follow-up Questions

Be ready for deeper probes after your main design:

  • How does the system behave during a ledger outage or network partition — do jobs keep running, and exactly how much overspend is possible? Is that tradeoff bounded and disclosed, or hidden?
  • A tenant bursts with thousands of jobs at once. What breaks first — the ledger, the rate limiter, the scheduler — and how do you protect each?
  • How do you guarantee a mid-job price change can't retroactively re-charge a running job?
  • Walk through a cross-shard transfer (org sub-accounts on different shards). Why might you prefer a saga with an escrow account over two-phase commit on the money path?
  • How would you test that overspend and double-spend are impossible — what invariants would a property-based test assert after every operation?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More OpenAI•More Software Engineer•OpenAI Software Engineer•OpenAI System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.