Design a Distributed Task Scheduling Infrastructure
Context
Design a distributed task scheduling and orchestration system that can run at scale, supports task dependencies (DAGs), persists state, and handles both short and long-running tasks.
Requirements
-
Data models
-
Define schemas/models for Tasks and DAGs, including runs/attempts and state transitions.
-
APIs
-
Submit a task or DAG, cancel, and query status/history.
-
Execution engine
-
How tasks are dispatched, leased, executed, and acknowledged.
-
Fault tolerance
-
Retries with backoff, idempotency, exactly-once vs at-least-once semantics.
-
Backpressure
-
Prevent overload and provide fairness.
-
Worker heartbeats
-
Registration, lease extensions, and liveness detection.
-
Stuck/straggler handling
-
Detection and mitigation (e.g., timeouts, speculative execution).
-
Scaling & multi-tenancy
-
Horizontal scale, isolation, quotas, fairness, and preemption.
-
Observability
-
Metrics, logs, tracing, audits, and a minimal UI model.
Assume a heterogeneous worker fleet (containers/VMs), a durable message bus, and a persistent metadata store.