Design a GPU credit system and scheduler
Company: OpenAI
Role: Software Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Technical Screen
Design a GPU credit accounting and scheduling service for an ML platform. Users purchase credits, submit training/inference jobs, and consume credits while jobs run. Requirements: credit issuance, balance queries, reservation at submission, metered consumption during execution, partial refunds on preemption/failure, expiration and promotional credits, per-user and per-project budgets, and audit trails. The API must be idempotent and concurrency-safe, with rate limits and protection against double-spend under races. The scheduler should place jobs on heterogeneous GPUs (e.g., A100/H
100) based on resource requirements and available quota, supporting fairness across users/teams and preemption policies. Describe schemas and data structures, consistency choices (strong vs. eventual), handling of clock skew, sharding and scaling strategies, and observability. Outline a test plan that captures edge cases and uncovers unspecified requirements.
Quick Answer: This question evaluates system design, distributed systems, and resource-accounting skills focused on concurrency control, idempotent APIs, billing/credit models, and scheduler design for heterogeneous GPUs in multi-tenant ML platforms.