Design a pipeline/workflow orchestration system (similar to a DAG-based scheduler) that runs workloads on Kubernetes.
Functional requirements
-
Users can define pipelines as
DAGs
of tasks (task dependencies).
-
Tasks run as containerized jobs on Kubernetes.
-
Support scheduling (cron / interval), ad-hoc runs, and manual retries.
-
Track task/run states (queued, running, success, failed, cancelled).
-
Provide logs per task and basic UI/API to view pipeline status.
-
Retry policy, timeouts, and failure notifications.
Non-functional requirements
-
Multi-tenant support (teams/namespaces) and RBAC.
-
Scale to many concurrent runs (e.g., thousands of tasks/minute).
-
High availability of control plane.
-
Handle worker/pod failures and scheduler restarts.
-
Observability: metrics, tracing, structured logs.
Constraints / assumptions
-
You can assume Kubernetes is available.
-
You can choose the storage, queue, and execution model.
Describe the high-level architecture, core components, data model, and how you handle scheduling, execution, retries, and failure recovery.