Design distributed parallel job processing

Q: Design distributed parallel job processing

This is a System Design interview question from LinkedIn for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Design a Distributed System for Parallel Job Execution

Context

You are asked to design a highly available, horizontally scalable service that executes many independent tasks (embarrassingly parallel jobs) across a compute cluster. Assume multi-tenant usage, heterogeneous task durations (milliseconds to hours), and the need for strong operational visibility and robust fault tolerance.

Requirements

Specify the following, with clear APIs, data models, and rationale:

Job model and external APIs
Task partitioning and sharding strategy
Scheduler and worker architecture
Coordination and concurrency control (leases, heartbeats, visibility timeouts)
Fault tolerance, retries, backoff, and dead-letter handling
Idempotency and deduplication approach
Progress tracking and result aggregation
Scaling and resource management (autoscaling, quotas, bin-packing)
Backpressure and prioritization
Ordering guarantees and trade-offs
Consistency semantics (at-least-once vs exactly-once) and trade-offs
Monitoring, alerting, and observability
Capacity planning assumptions and calculations

State any minimal assumptions you need to make the design concrete.

Design distributed parallel job processing

Design a Distributed System for Parallel Job Execution

Context

Requirements

Solution (Locked)

Comments (0)