Design a GPU Credit Accounting and Scheduling Service (Technical Screen)
You are designing a backend service for an ML platform that runs training and inference jobs on heterogeneous GPUs (e.g., A100, H100). Users and teams purchase credits and consume them while their jobs run. Design the system end to end: the credit ledger, the reservation/metering flow, and the scheduler that places jobs on GPUs.
The system is multi-tenant, multi-project, and multi-region, and must:
-
Prevent double-spend
under concurrency, retries, and races.
-
Schedule fairly
across users and teams.
-
Handle preemption and failures
with correct partial refunds.
Assumptions
-
GPU pricing is
per GPU-hour
and differs by GPU type.
-
Jobs specify resource requirements: GPU-type preferences, GPU count, and memory.
-
Jobs may be
preempted
according to policy.
Functional Requirements
1. Credit lifecycle
-
Issuance (purchases, grants, promotions) and
expiration
.
-
Balance queries with a breakdown (promotional vs. paid, upcoming expirations).
-
Spend ordering
across credit buckets (e.g., earliest-expiring first).
2. Reservation and metering
-
Idempotent reservation
at job submission that checks budgets and quotas.
-
Metered consumption
while a job runs: commit actual usage, and
partially refund
the unused hold on completion, preemption, or failure.
3. Budgets and quotas
-
Per-user and per-project budgets, with
hierarchical limits
(team/org → project → user).
-
Promotional credits with separate policies and expiration.
4. Scheduling
-
Place jobs on
heterogeneous GPUs
based on their requirements and available quota/credits.
-
Fairness
across users/teams, with support for weights/priority classes and
preemption
.
5. Audit and observability
-
An
immutable audit trail
for all credit
and
scheduling decisions.
-
Metrics, logs, and traces for SLOs and debugging.
Non-Functional Requirements
-
APIs must be
idempotent
and
concurrency-safe
, with rate limits.
-
Protect against
double-spend
under races and retries.
-
State your
consistency choices
explicitly (strong vs. eventual) and handle
clock skew
.
-
Describe
sharding/scaling
strategies for high throughput.
Deliverables
Address each of the following:
-
Architecture overview
— components and data flow.
-
Data schemas
and key data structures.
-
API design
and idempotency model.
-
Scheduling algorithm
and preemption policies.
-
Consistency model and concurrency control
— including double-spend protection and clock-skew handling.
-
Sharding and scaling
strategy.
-
Observability
plan.
-
A
test plan
that exercises edge cases and surfaces unspecified requirements.