Design async job orchestration and notification service
Company: Amazon
Role: Machine Learning Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
Design a service that accepts a list of parameters, uses an external asynchronous cluster API submit_job(params) -> job_uuid and check_job_status(job_uuid) -> ['SUBMITTED','RUNNING','SUCCEEDED','FAILED'], executes the jobs, and notifies the user when all jobs in the list have completed. Specify the architecture and components (ingest API, scheduler, worker pool, status tracker, persistent store, notification service). Detail the execution flow for fan-out submission, concurrency and rate limiting, retry/backoff and idempotency for submissions and status checks, persistence for crash recovery, handling of partial failures and retries for FAILED jobs, timeouts and detection of stuck jobs, deduplication, and cancellation. Explain how you would scale to millions of jobs and choose between polling versus event-driven callbacks if available. Define client- and server-side APIs, notification semantics (exactly-once vs at-least-once), and monitoring/alerting and observability.
Quick Answer: This question evaluates system design and distributed-systems competencies needed for designing scalable, reliable asynchronous job orchestration and notification services, covering concepts such as idempotency, eventual consistency, rate limiting, retries/backoff, failure recovery, deduplication, and API/notification semantics.