This question evaluates system design and distributed-systems competencies needed for designing scalable, reliable asynchronous job orchestration and notification services, covering concepts such as idempotency, eventual consistency, rate limiting, retries/backoff, failure recovery, deduplication, and API/notification semantics.
You are designing a backend service that accepts a list of job parameters from clients, submits each job to an external asynchronous compute cluster, and notifies the client when all jobs complete. The external cluster exposes two APIs:
Assume the external API is eventually consistent, has rate limits, may transiently fail, and does not guarantee idempotency unless you add your own keys. The cluster may optionally support callbacks/webhooks on job completion; otherwise, you must poll.
Design the service with the following components and details:
Keep the design practical and production-ready, with clear assumptions and trade-offs.
Login required