Design a GPU Credit Accounting and Scheduling Service (Technical Screen)
Context
You are designing a backend service for an ML platform that runs training and inference on heterogeneous GPUs (e.g., A100, H100). Users/teams purchase credits and consume them while jobs run. The platform must prevent double-spend under concurrency, schedule fairly across users/teams, and handle preemption/failures with partial refunds.
Assume GPU pricing is per GPU-hour and differs by GPU type. Jobs specify resource requirements (GPU type preferences, count, memory) and may be preempted according to policy. The system is multi-tenant, multi-project, and multi-region.
Functional Requirements
-
Credit lifecycle
-
Issuance (purchases, grants, promotions) and expiration.
-
Balance queries with breakdown (promotional vs paid, expirations).
-
Spend ordering across buckets (e.g., earliest-expiring first).
-
Reservation and metering
-
Idempotent reservation at job submission that checks budgets/quotas.
-
Metered consumption while jobs run; commit actual usage and partially refund unused holds on completion, preemption, or failure.
-
Budgets and quotas
-
Per-user and per-project budgets; hierarchical limits (team/org → project → user).
-
Promotional credits with separate policies and expiration.
-
Scheduling
-
Place jobs on heterogeneous GPUs based on requirements and available quota/credits.
-
Fairness across users/teams; support weights/priority classes and preemption.
-
Audit and observability
-
Immutable audit trail for all credit and scheduling decisions.
-
Metrics, logs, and traces for SLOs and debugging.
Non-Functional Requirements
-
APIs must be idempotent and concurrency-safe with rate limits.
-
Protect against double-spend under races and retries.
-
Clearly state consistency choices (strong vs eventual) and handle clock skew.
-
Sharding/scaling strategies for high throughput.
Deliverables
Provide:
-
Architecture overview (components and data flow).
-
Data schemas and key data structures.
-
API design and idempotency model.
-
Scheduling algorithm and preemption policies.
-
Consistency model and concurrency control (including double-spend protection and clock skew handling).
-
Sharding and scaling strategy.
-
Observability plan.
-
A test plan that exercises edge cases and surfaces unspecified requirements.