Design async job orchestration and notification service

Q: Design async job orchestration and notification service

This question evaluates system design and distributed-systems competencies needed for designing scalable, reliable asynchronous job orchestration and notification services, covering concepts such as idempotency, eventual consistency, rate limiting, retries/backoff, failure recovery, deduplication, and API/notification semantics.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: Batch Orchestration Over an External Asynchronous Cluster

Context

You are designing a backend service that accepts a list of job parameters from clients, submits each job to an external asynchronous compute cluster, and notifies the client when all jobs complete. The external cluster exposes two APIs:

submit_job(params) -> job_uuid
check_job_status(job_uuid) -> one of ['SUBMITTED', 'RUNNING', 'SUCCEEDED', 'FAILED']

Assume the external API is eventually consistent, has rate limits, may transiently fail, and does not guarantee idempotency unless you add your own keys. The cluster may optionally support callbacks/webhooks on job completion; otherwise, you must poll.

Requirements

Design the service with the following components and details:

Architecture and components

Ingest API
Scheduler
Worker pool (submitters and status pollers)
Status tracker and aggregator
Persistent store
Notification service

Execution flow

Fan-out submission from a list of parameters
Concurrency control and rate limiting (per-tenant and global)
Retry, backoff with jitter, and idempotency for both submissions and status checks
Persistence for crash recovery and exactly-once-orchestrator semantics
Handling partial failures, including automatic retries for FAILED jobs based on policy
Timeouts and detection of stuck jobs (SUBMITTED/RUNNING too long)
Deduplication (same input re-sent by client or internally retried)
Cancellation (single job or entire batch)

Scale and strategy

How to scale to millions of jobs and high QPS
Polling vs event-driven callbacks (if available): trade-offs and hybrid approach

APIs and semantics

Define client-facing APIs and server-to-server APIs if applicable
Notification semantics (exactly-once vs at-least-once); integrity and dedup

Operations

Monitoring, alerting, and observability (metrics, logs, traces; SLOs)

Keep the design practical and production-ready, with clear assumptions and trade-offs.

Design async job orchestration and notification service

System Design: Batch Orchestration Over an External Asynchronous Cluster

Context

Requirements

Solution

Comments (0)

Design async job orchestration and notification service

Overview

System Design: Batch Orchestration Over an External Asynchronous Cluster

Context

Requirements

Solution

Comments (0)