Design a Scalable Job Scheduler
Company: Airbnb
Role: Software Engineer
Category: System Design
Interview Round: Onsite
Design an internal job scheduling platform for a large company.
The platform should allow internal services and engineers to submit jobs that run at a specified time or on a recurring schedule. The system should scale to roughly 10 million scheduled jobs, execute jobs reliably, and expose operational visibility to users and on-call engineers.
Address the following:
- Functional requirements: create, update, cancel, and inspect jobs; support one-time and recurring jobs; dispatch ready jobs to workers; retry failed jobs.
- Non-functional requirements: high availability, horizontal scalability, low scheduling latency, durability, idempotency, and observability.
- API design, data model, scheduler architecture, worker/executor design, failure handling, and monitoring.
- Explain how the system avoids double execution while still recovering from crashes.
Quick Answer: This question evaluates system design and distributed systems competencies, focusing on scheduling semantics, fault tolerance, idempotency, API and data modeling, horizontal scalability, and operational observability for large-scale job execution.