PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Amazon

Design async job orchestration and notification service

Last updated: Mar 29, 2026

Quick Overview

This question evaluates system design and distributed-systems competencies needed for designing scalable, reliable asynchronous job orchestration and notification services, covering concepts such as idempotency, eventual consistency, rate limiting, retries/backoff, failure recovery, deduplication, and API/notification semantics.

  • hard
  • Amazon
  • System Design
  • Machine Learning Engineer

Design async job orchestration and notification service

Company: Amazon

Role: Machine Learning Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

Design a service that accepts a list of parameters, uses an external asynchronous cluster API submit_job(params) -> job_uuid and check_job_status(job_uuid) -> ['SUBMITTED','RUNNING','SUCCEEDED','FAILED'], executes the jobs, and notifies the user when all jobs in the list have completed. Specify the architecture and components (ingest API, scheduler, worker pool, status tracker, persistent store, notification service). Detail the execution flow for fan-out submission, concurrency and rate limiting, retry/backoff and idempotency for submissions and status checks, persistence for crash recovery, handling of partial failures and retries for FAILED jobs, timeouts and detection of stuck jobs, deduplication, and cancellation. Explain how you would scale to millions of jobs and choose between polling versus event-driven callbacks if available. Define client- and server-side APIs, notification semantics (exactly-once vs at-least-once), and monitoring/alerting and observability.

Quick Answer: This question evaluates system design and distributed-systems competencies needed for designing scalable, reliable asynchronous job orchestration and notification services, covering concepts such as idempotency, eventual consistency, rate limiting, retries/backoff, failure recovery, deduplication, and API/notification semantics.

Related Interview Questions

  • Design a Log Collection System - Amazon (medium)
  • Design Human Avoidance for Warehouse Robots - Amazon (medium)
  • Design a High-Availability Load Balancer - Amazon (hard)
  • Design a Ride-Hailing Matching System - Amazon (medium)
  • Design a cloud database write path and recovery - Amazon (hard)
Amazon logo
Amazon
Jul 15, 2025, 12:00 AM
Machine Learning Engineer
Technical Screen
System Design
1
0

System Design: Batch Orchestration Over an External Asynchronous Cluster

Context

You are designing a backend service that accepts a list of job parameters from clients, submits each job to an external asynchronous compute cluster, and notifies the client when all jobs complete. The external cluster exposes two APIs:

  • submit_job(params) -> job_uuid
  • check_job_status(job_uuid) -> one of ['SUBMITTED', 'RUNNING', 'SUCCEEDED', 'FAILED']

Assume the external API is eventually consistent, has rate limits, may transiently fail, and does not guarantee idempotency unless you add your own keys. The cluster may optionally support callbacks/webhooks on job completion; otherwise, you must poll.

Requirements

Design the service with the following components and details:

  1. Architecture and components
  • Ingest API
  • Scheduler
  • Worker pool (submitters and status pollers)
  • Status tracker and aggregator
  • Persistent store
  • Notification service
  1. Execution flow
  • Fan-out submission from a list of parameters
  • Concurrency control and rate limiting (per-tenant and global)
  • Retry, backoff with jitter, and idempotency for both submissions and status checks
  • Persistence for crash recovery and exactly-once-orchestrator semantics
  • Handling partial failures, including automatic retries for FAILED jobs based on policy
  • Timeouts and detection of stuck jobs (SUBMITTED/RUNNING too long)
  • Deduplication (same input re-sent by client or internally retried)
  • Cancellation (single job or entire batch)
  1. Scale and strategy
  • How to scale to millions of jobs and high QPS
  • Polling vs event-driven callbacks (if available): trade-offs and hybrid approach
  1. APIs and semantics
  • Define client-facing APIs and server-to-server APIs if applicable
  • Notification semantics (exactly-once vs at-least-once); integrity and dedup
  1. Operations
  • Monitoring, alerting, and observability (metrics, logs, traces; SLOs)

Keep the design practical and production-ready, with clear assumptions and trade-offs.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Amazon•More Machine Learning Engineer•Amazon Machine Learning Engineer•Amazon System Design•Machine Learning Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.