This question evaluates understanding of distributed systems architecture, parallel job scheduling, fault tolerance, coordination and concurrency control, API and data model design, idempotency, observability, and capacity planning for multi-tenant, heterogeneous workloads.
You are asked to design a highly available, horizontally scalable service that executes many independent tasks (embarrassingly parallel jobs) across a compute cluster. Assume multi-tenant usage, heterogeneous task durations (milliseconds to hours), and the need for strong operational visibility and robust fault tolerance.
Specify the following, with clear APIs, data models, and rationale:
State any minimal assumptions you need to make the design concrete.
Login required