##### Question
Design a GPU credits allocation system that tracks user GPU consumption, deducts credits in real-time, supports credit top-ups, rate limiting, and fair usage enforcement across multiple GPU nodes.
Quick Answer: This question evaluates system design competencies in distributed resource accounting, real-time billing, fault tolerance, rate limiting, and fair scheduling for multi-tenant GPU compute platforms, situated in the System Design domain.
Solution
# Overview and Key Concepts
This design uses a central, strongly consistent credit ledger with short-lived credit "leases" to enable real-time deduction and prevent double-spend across many GPU nodes. Nodes stream usage to a metering pipeline; a reconciler ensures the ledger matches actual usage. Rate limiting and fair usage are enforced at admission and during execution.
Key terms:
- Credit balance B(u): user's available credits.
- Price p(type): price per GPU-second for a GPU type.
- Usage rate r = g × p(type), where g is number of GPUs allocated to a job.
- Lease window L seconds: the system authorizes up to r × L credits for a job, deducted immediately; extensions renew as the job runs.
- Token bucket for rate limits: spend rate and concurrency caps.
Small numeric example:
- p(H100) = 0.01 credits/GPU-second.
- User runs 2 GPUs for 10 minutes = 2 × 600 × 0.01 = 12 credits.
- With L = 15 s, each extension deducts 2 × 15 × 0.01 = 0.3 credits.
# Architecture
Components:
1. Auth & Account Service: user identity, account state, limits.
2. Pricing Service: GPU type → price per GPU-second; effective discounts.
3. Credit Ledger Service (strong consistency): maintains balances, immutable ledger entries, and active holds/leases. Backed by a transactional DB (e.g., PostgreSQL/CockroachDB/Spanner). All monetary mutations are idempotent with request_id.
4. Rate Limiter & Quota Service: per-user token buckets (spend/sec), concurrency limits (max GPUs), and daily caps. Backed by Redis/Redis Cluster with Lua for atomic ops.
5. Scheduler/Admission Controller: decides whether to start/continue jobs based on rate limits and credit leases; integrates with cluster scheduler.
6. Node Agent (on each GPU node): measures actual GPU usage (per process/pod), heartbeats usage every t seconds, requests lease extensions, and enforces throttling/termination if lease cannot be extended.
7. Usage Pipeline: agents send usage events to Kafka/PubSub; an Aggregator normalizes and rolls up usage.
8. Reconciler: reconciles rolled-up usage with provisional charges; posts adjustments to the ledger if needed.
9. Top-up Service: processes payments and applies credits; publishes events for notifications.
10. Observability: metrics, logs, alerts; ledger invariants and drift monitors.
# Data Model (simplified)
- users(id, status, limits, created_at)
- balances(user_id, available_credits, updated_at) — single row per user; strong consistency
- ledger_entries(id, user_id, type: [topup|debit|refund|adjustment], amount, request_id, created_at)
- leases(id, user_id, job_id, gpu_type, gpu_count, rate_per_sec, expires_at, amount_reserved, status)
- usage_events(id, user_id, job_id, node_id, gpu_type, gpu_count, start_ts, end_ts, seconds, measured_utilization, seq)
- price_table(gpu_type, price_per_sec, effective_from)
- limits(user_id, max_concurrent_gpus, max_spend_per_sec, daily_cap, weight)
Notes:
- All ledger mutations carry a unique request_id for idempotency.
- balances available_credits should be updated only by transactional operations that also insert a ledger_entry.
# Real-Time Deduction with Leases
Goal: allow jobs on many nodes to consume credits without double-spend and with bounded exposure during failures.
Mechanism:
1. Admission: for a new job with gpu_count g and gpu_type t, compute r = g × p(t). Check rate limits and balance.
2. Create lease for window L seconds: atomically deduct r × L from balance and record a lease row with expires_at = now + L.
- If balance < r × L, reject or place the job in a waiting queue.
3. Node agent runs the job and sets a heartbeat timer every t (< L) seconds.
4. Extension: each heartbeat attempts to extend the lease by Δ = min(t, L) seconds:
- Atomically: deduct r × Δ, push expires_at by Δ, and append a ledger_entry (debit) with request_id = (job_id, seq).
- If deduction fails (insufficient funds or rate limit breach), agent is instructed to throttle or gracefully terminate (configurable grace_g seconds using remaining time in lease).
5. Completion: when job ends, close the lease and release any unused reserved time if the implementation pre-reserves more than consumed.
Two implementation patterns:
- Strict pay-as-you-go: do not pre-reserve beyond the next Δ; deduct each extension immediately. Exposure = r × (grace_g + one extension).
- Reserve-then-burn: pre-reserve r × L upfront to reduce extension path latency; burn down the reservation as usage arrives. On completion, refund unused reserved portion. This lowers ledger write QPS but requires careful reconciliation.
Atomicity options:
- SQL row-level locks: UPDATE balances SET available = available - x WHERE user_id = ? AND available >= x; INSERT ledger_entry ... in the same transaction.
- Redis with Lua script: atomic balance check, decrement, and append to a write-ahead queue for persistence.
# Rate Limiting
Use token bucket and counters, enforced at admission and on lease extension:
- Concurrency cap: current_gpus(user) + g ≤ max_concurrent_gpus.
- Spend rate cap: per-user bucket with capacity C and fill rate F = max_spend_per_sec. Each extension consumes r × Δ tokens; if unavailable, throttle.
- Daily cap: track sum(ledger.debit) for the day; deny new leases once exceeded.
Data placement: use Redis hash per user for counters; use TTL keys to track active jobs. Lua scripts ensure atomic increments/decrements.
# Fair Usage Across Users
When GPUs are scarce, ensure fairness via the scheduler:
- Per-user share: weight(u) defines a target fraction of GPUs (e.g., equal weights by default). Scheduler admits jobs so that active_gpus(u)/weight(u) are roughly balanced.
- Dominant Resource Fairness (DRF) if CPU/memory also constrained.
- Backpressure: if a user is at or above their fair share, queue or preempt low-priority jobs.
- Priority classes: paying tiers map to higher weights and possibly preempt lower tiers within policy.
Coordination:
- Scheduler periodically computes target shares using cluster state and user weights.
- Admission Controller consults both: credits (lease), rate limits, and fair-share state before starting a pod.
# Failure Modes and Guardrails
- Node/Agent crash: no new extensions occur; lease expires at expires_at, bounding overspend to r × (remaining_lease + grace_g). Reconciler may refund any overcharge if job actually stopped earlier.
- Ledger outage: allow short offline operation via local budget cache per node (small on-box escrow, e.g., 1–2 extension windows). On reconnect, reconcile and stop jobs if overspent.
- Network partitions: leases prevent double-spend; extensions fail when the ledger is unreachable beyond escrow.
- Double charge/double spend: prevented via idempotent request_ids and atomic ops. Usage events carry monotonic seq numbers per job to dedupe.
- Clock skew: rely on server-side timestamps for leases; agents include both wall and monotonic times for diagnostics but ledger is source of truth.
- Price changes: price_table is versioned by effective_from; leases record price at time of debit; reconciler uses the recorded rate.
# Reconciliation Pipeline
- Agents send usage_events every t seconds and at job end, with measured seconds and GPU type/count.
- Aggregator rolls up to per-job intervals.
- Reconciler compares rolled-up actual cost vs provisional debits:
adjustment = actual_cost - sum(provisional_debits)
- If positive: create a debit adjustment (rare if measuring undercount).
- If negative: create a refund credit to user.
- Drift alerting if |adjustment| exceeds a threshold per job or per user.
# APIs (idempotent)
1. POST /topups {amount, request_id} → {new_balance}
2. GET /balance → {available, holds}
3. POST /leases {job_id, gpu_type, gpu_count, window_seconds, request_id} → {lease_id, rate_per_sec, expires_at}
4. POST /leases/{lease_id}/extend {delta_seconds, request_id} → {expires_at}
5. POST /leases/{lease_id}/close {request_id} → {finalized}
6. POST /usage {job_id, seq, gpu_type, gpu_count, start_ts, end_ts}
7. GET /limits → {concurrency, spend_rate, daily_cap}
All write APIs require request_id for idempotency; return the same result if retried.
# Scalability
- Shard ledger by user_id; keep one primary row per user to avoid hot-spotting.
- Use append-only ledger_entries and periodic compaction/snapshots for fast balance reads.
- Batch extensions: coalesce per-job debits up to a maximum Δ (e.g., 5–15 s) to reduce write QPS.
- Redis Cluster for rate limiter; key hashing on user_id to ensure locality.
- Kafka partitions by user_id for usage events to maintain order per job.
# Security and Integrity
- Signed lease tokens (JWT) returned to nodes include lease_id, rate, expires_at; agents cannot mint tokens.
- mTLS between agents and control plane.
- Least-privilege service accounts; WAF on public APIs.
# Example Walkthrough
Given: User balance B = 50 credits, p(H100)=0.01, job uses g=4 GPUs.
- r = 4 × 0.01 = 0.04 credits/s.
- L = 15 s, Δ = 5 s heartbeats.
Flow:
1. Admission: concurrency and spend_rate allow. Create lease, deduct 0.04 × 15 = 0.6 credits. New balance: 49.4.
2. After 5 s: extend by 5 s, deduct 0.2 credits → balance 49.2.
3. Repeat until job ends. If job runs 8 minutes (480 s): total cost = 0.04 × 480 = 19.2 credits. Ledger shows ~96 debits of 0.2 plus initial 0.6, minus refund of unused reserved portion if applicable.
4. If balance would drop below required extension, agent gets NACK; it uses remaining lease time and then stops. Max overspend bounded by one Δ plus any grace_g.
# Testing and Validation
- Unit tests for atomic operations and idempotency with concurrent requests.
- Fault injection: kill ledger nodes, partition networks, kill agents; verify bounded exposure and reconciliation.
- Property tests for ledger invariants: sum(ledger) == balances, no negative balances, request_id uniqueness.
- Load tests: simulate 10k jobs with Δ=5 s; ensure ledger write QPS and tail latencies meet SLOs.
# Alternatives and Trade-offs
- Postpaid billing: simpler runtime but requires credit risk management; unsuitable if strict prepay is required.
- Longer leases reduce write QPS but increase exposure; choose L based on risk appetite and latency.
- Full central scheduler fairness vs. per-node local fairness: central is more accurate; local is more resilient.
# Summary
Use a strongly consistent credit ledger with short, renewable leases to gate execution and deduct in near real time. Enforce per-user rate limits and fair-share at admission and during execution. Stream usage to an immutable pipeline and reconcile to ensure correctness and auditability. Bound overspend via lease expiry and small on-node escrows, and harden with idempotent APIs, atomic mutations, and robust observability.