This question evaluates a candidate's ability to design scalable, fault-tolerant job scheduler systems, covering competencies in data modeling, public APIs, component architecture (ingress, scheduler, queue, workers), scheduling algorithms, idempotency, retry semantics, observability, and failure handling.
Assume a multi-tenant service that schedules and runs user-defined jobs (HTTP/webhook, internal tasks, etc.). Jobs may be one-off or recurring (e.g., cron). The system must operate at scale with strong observability and fault tolerance.
Specify the following:
Design schemas for jobs, schedules, executions, workers, and supporting entities with fields such as: id, tenant, payload, schedule type/cron, next_run_at, status, retry policy, dedupe key, shard key, and updated_at.
Define APIs for creating, updating, canceling, and querying jobs; listing upcoming jobs; and triggering immediate runs. Include idempotency and filters.
Describe ingress, scheduler, queue/broker, workers, persistence, and timers. Explain idempotency, retries/backoff, dead-lettering, observability, and failure handling.
Explain indexing/partitioning, range scans on time-ordered keys, min-heap/time-wheel approaches, caching, and contention control.
Discuss scaling strategy, at-least-once vs exactly-once execution semantics, multi-region architecture, and clock-skew mitigation. Provide trade-offs and simple diagrams.
Login required