This question evaluates system design, distributed systems, and resource-accounting skills focused on concurrency control, idempotent APIs, billing/credit models, and scheduler design for heterogeneous GPUs in multi-tenant ML platforms.
Design a GPU credit accounting and scheduling service for an ML platform. Users purchase credits, submit training/inference jobs, and consume credits while jobs run. Requirements: credit issuance, balance queries, reservation at submission, metered consumption during execution, partial refunds on preemption/failure, expiration and promotional credits, per-user and per-project budgets, and audit trails. The API must be idempotent and concurrency-safe, with rate limits and protection against double-spend under races. The scheduler should place jobs on heterogeneous GPUs (e.g., A100/H
100) based on resource requirements and available quota, supporting fairness across users/teams and preemption policies. Describe schemas and data structures, consistency choices (strong vs. eventual), handling of clock skew, sharding and scaling strategies, and observability. Outline a test plan that captures edge cases and uncovers unspecified requirements.
Quick Answer: This question evaluates system design, distributed systems, and resource-accounting skills focused on concurrency control, idempotent APIs, billing/credit models, and scheduler design for heterogeneous GPUs in multi-tenant ML platforms.
hardSoftware EngineerTechnical ScreenML System Design
50
0
Design a GPU Credit Accounting and Scheduling Service (Technical Screen)
You are designing a backend service for an ML platform that runs training and inference jobs on heterogeneous GPUs (e.g., A100, H100). Users and teams purchase credits and consume them while their jobs run. Design the system end to end: the credit ledger, the reservation/metering flow, and the scheduler that places jobs on GPUs.
The system is multi-tenant, multi-project, and multi-region, and must:
Prevent double-spend
under concurrency, retries, and races.
Schedule fairly
across users and teams.
Handle preemption and failures
with correct partial refunds.
Constraints & Assumptions
Anchor the design to these. Where a number is not given, state the assumption you make and design to it.
GPU pricing is
per GPU-hour
and differs by GPU type.
Jobs specify resource requirements: GPU-type preferences (ordered), GPU count, and a memory floor.
Jobs may be
preempted
according to policy; some jobs are non-preemptible.
Suggested sizing to design against (adjust and justify if you prefer different numbers): tens of thousands of accounts, low-thousands of concurrently running jobs, and per-job metering heartbeats on the order of every 30–60 s. The metering write path is therefore the highest-QPS mutation, while reservation and settlement are lower QPS but must be strictly correct.
Functional Requirements
1. Credit lifecycle
Issuance (purchases, grants, promotions) and
expiration
.
Balance queries with a breakdown (promotional vs. paid, upcoming expirations).
Spend ordering
across credit buckets (e.g., earliest-expiring first).
2. Reservation and metering
Idempotent reservation
at job submission that checks budgets and quotas.
Metered consumption
while a job runs: commit actual usage, and
partially refund
the unused hold on completion, preemption, or failure.
3. Budgets and quotas
Per-user and per-project budgets, with
hierarchical limits
(team/org → project → user).
Promotional credits with separate policies and expiration.
4. Scheduling
Place jobs on
heterogeneous GPUs
based on their requirements and available quota/credits.
Fairness
across users/teams, with support for weights/priority classes and
preemption
.
5. Audit and observability
An
immutable audit trail
for all credit
and
scheduling decisions.
Metrics, logs, and traces for SLOs and debugging.
Non-Functional Requirements
APIs must be
idempotent
and
concurrency-safe
, with rate limits.
Protect against
double-spend
under races and retries.
State your
consistency choices
explicitly (strong vs. eventual) and handle
clock skew
.
Describe
sharding/scaling
strategies for high throughput.
Clarifying Questions to Ask
A strong candidate scopes the problem before designing. Good questions to raise with the interviewer:
What is the read:write split, and which path is hottest — balance reads, reservations, or metering heartbeats? (This decides what to optimize and where eventual consistency is acceptable.)
How strict are the hierarchical budgets — are org/project/user limits
hard
(reject on breach) or
soft
(alert only), and may a brief overshoot be tolerated for the largest orgs?
When a job is placed on a non-preferred GPU type (e.g., an H100-preferring job lands on an A100), which type's price applies, and is the price fixed at start or allowed to change mid-run?
On completion, preemption, or failure, where does the
unused hold
go — back to the exact buckets it was drawn from (preserving source and expiry), or into a fresh balance?
What is the preemption contract — is there a checkpoint grace period before a job is killed, and are non-preemptible jobs ever reclaimed for capacity (vs. only stopped when out of credits)?
Where does an account's money "live" relative to where its jobs run — single home region, or can any region both spend and place?
What a Strong Answer Covers Premium
Deliverables
Address each of the following. Treat each as a part of the design; the hints below point toward an approach without giving the answer.
1. Architecture overview — components and data flow.
2. Data schemas and key data structures.
3. API design and idempotency model.
4. Scheduling algorithm and preemption policies.
5. Consistency model and concurrency control — including double-spend protection and clock-skew handling.
6. Sharding and scaling strategy.
7. Observability plan — metrics, logs, traces, and the audit trail.
8. Test plan that exercises edge cases and surfaces unspecified requirements.
Follow-up Questions
Be ready for deeper probes after the main design:
How does the design change at
100×
scale (e.g., 1M+ accounts, tens of thousands of concurrent jobs)? What breaks first — the metering path, the org-budget counter, or the scheduler leader?
Walk through the exact sequence of ledger/state changes for one job that is
preempted at 60% of its hold
— show that the refund and the audit trail stay consistent.
A
finish
call is
lost
(caller crashes after the job stops). How are the held credits eventually returned, and what stops them from being stranded or double-refunded?
A promotional credit bucket
expires while a job is mid-run
against it. What happens to the in-flight hold, and to any refund that lands after the expiry instant?