Design a Distributed System for Parallel Job Execution
Context
You are asked to design a highly available, horizontally scalable service that executes many independent tasks (embarrassingly parallel jobs) across a compute cluster. Assume multi-tenant usage, heterogeneous task durations (milliseconds to hours), and the need for strong operational visibility and robust fault tolerance.
Requirements
Specify the following, with clear APIs, data models, and rationale:
-
Job model and external APIs
-
Task partitioning and sharding strategy
-
Scheduler and worker architecture
-
Coordination and concurrency control (leases, heartbeats, visibility timeouts)
-
Fault tolerance, retries, backoff, and dead-letter handling
-
Idempotency and deduplication approach
-
Progress tracking and result aggregation
-
Scaling and resource management (autoscaling, quotas, bin-packing)
-
Backpressure and prioritization
-
Ordering guarantees and trade-offs
-
Consistency semantics (at-least-once vs exactly-once) and trade-offs
-
Monitoring, alerting, and observability
-
Capacity planning assumptions and calculations
State any minimal assumptions you need to make the design concrete.