Design a GPU-Efficient Video Service
Company: OpenAI
Role: Software Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Technical Screen
Design a text-to-video generation platform similar to a modern generative video product. Treat the actual model inference on GPUs as a black box: a job enters a GPU worker and eventually produces a video.
Focus on the serving platform rather than model internals. The main requirements are:
- Users submit prompts and generation parameters, receive a job ID, and later fetch the result.
- GPU capacity is fixed or slow to scale, so the system cannot rely on instant autoscaling.
- Traffic is bursty.
- The system should maximize GPU utilization while still providing a predictable user experience.
- The design should cover queueing, scheduling, admission control, prioritization, storage, failure handling, and observability.
Explain how you would design the APIs, control plane, worker architecture, scheduling strategy, and overload behavior.
Quick Answer: This question evaluates competency in designing GPU-constrained, production-grade ML serving platforms, emphasizing resource management, job scheduling and prioritization, admission control, durable storage, failure handling, and observability within distributed systems.