Cluster Job Scheduling And Resource Isolation

What's being tested

Interviewers probe your ability to design and reason about a multi-tenant cluster scheduler that balances throughput, job latency, and tenant isolation. They expect familiarity with real scheduler primitives (admission control, fairness, preemption, gang scheduling), practical isolation mechanisms (container limits, `cgroups`), and the tradeoffs between utilization and predictability. Databricks cares because efficient, isolation-safe scheduling directly affects job completion times, cluster utilization, and customer SLAs for analytics and ML workloads.

Core knowledge

Scheduler objectives: trade off throughput (jobs/sec), mean/median job completion time, tail latency (p99), utilization, and fairness; optimize using weighted combinations or SLO-driven admission control.
Scheduling algorithms: offline bin-packing is NP-hard; use heuristics like First-Fit Decreasing, Best-Fit, and variant multi-resource bin-packing; quantify cost: heuristics are O(n log n) or O(n·m).
Multi-resource fairness: Dominant Resource Fairness (DRF) generalizes fairness across CPU/memory/GPU; each tenant’s allocation is measured by its dominant share and used to equalize shares.
Gang scheduling and co-allocation: for parallel jobs (e.g., Spark executors), schedule whole groups atomically to avoid head-of-line blocking; implement via reservation or hold-and-backfill strategies to avoid fragmentation.
Preemption & checkpoints: preemption reduces tail latency but wastes work; pair with checkpointing or application-level retries; quantify break-even: preemption cost < waiting-time improvement.
Backfilling & headroom: backfilling increases utilization by running small jobs in gaps; requires accurate runtime estimates and strict admission control to prevent starving larger jobs.
Admission control & autoscaling: enforce capacity by rejecting or queuing additional jobs; autoscaling adds nodes when sustained headroom < threshold; avoid thrash via cooldown and hysteresis.
Resource isolation primitives: use `cgroups` (v1/v2) for CPU shares/sets, memory limits and OOM handling, and Linux namespaces for network/IPC isolation; CPU pinning reduces jitter but increases fragmentation.
OOM and memory accounting: memory limits may trigger OOM killer; prefer hard limits for tenants requiring isolation, soft limits plus swap for best-effort workloads; monitor RSS vs. cached pages.
Network and disk QoS: isolate noisy neighbors via token buckets, tc for bandwidth shaping, and per-tenant IOPS limits (kernel blkio or storage-level QoS).
Scheduling metrics & formulas: utilization = busy_time / total_time; job slowdown = (wait_time + run_time) / run_time; use exponential moving averages for runtime predictions.
State & failure modes: scheduler must handle node flakiness (eviction, transient disconnects) and adopt conservative allocations (replica placement, avoiding correlated failures, respecting rack-awareness).

Worked example — "Design a multi-tenant scheduler that minimizes average job completion time while ensuring tenant isolation"

First 30s: ask about workload mix (short vs long jobs, batch vs streaming), resource types (CPU/GPU/memory/disk/net), SLA per tenant, and whether jobs support checkpointing. Declare assumptions: heterogeneous nodes, preemption allowed, and accurate runtime estimates within ±30%.

Skeleton of answer:

Define objectives and SLOs: weighted average completion time + tenant isolation constraints (max dominant share).
Admission control layer: per-tenant queue with rate/weight and `DRF`-based capacity caps; reject/queue when tenant exceeds share.
Core scheduler: combine best-fit decreasing for bin-packing with backfilling for small jobs; apply gang scheduling for parallel jobs.
Isolation enforcement: use `cgroups` for CPU/memory limits and network shaping for noisy neighbors; implement soft limits for best-effort workloads.
Preemption and checkpointing: preempt only low-priority or best-effort jobs; require checkpointing for preemptible long jobs to bound wasted work.

Tradeoff to flag: aggressive preemption reduces average wait time but increases wasted compute and may reduce utilization—quantify threshold when preemption yields net benefit. Close: if more time, add runtime prediction model, adaptive backoff between backfilling and strict capacity, and multi-cluster placement optimization considering data locality.

A second angle — "How to support bursty ML training jobs with large GPU/Memory requests"

Same primitives apply but constraints shift: GPUs are scarce and hard to preempt. Prioritize gang scheduling and reservation windows to co-allocate all GPUs, use bin-packing across node-level GPU topology (NUMA and PCIe), and prefer scheduling policies that reserve headroom for expected bursts. Make checkpointing mandatory for long GPU jobs; where checkpointing is expensive, prefer pre-reservation and admission control rather than preemption. For throughput, consider opportunistic packing of small experiments into idle GPUs using container-level isolation and cgroup GPU accounting (or device plugin frameworks in `Kubernetes`).

Common pitfalls

Pitfall: Over-optimizing for utilization. Many candidates maximize utilization and neglect tail latency or SLA guarantees; explicitly show how utilization gains would impact tenant SLOs and why a balanced objective is needed.

Pitfall: Treating resources as fungible. Saying "just pack CPU and memory" ignores multi-resource conflicts—explain DRF or dominant-share metrics to avoid starvation when jobs are bottlenecked by different resources.

Pitfall: Ignoring preemption costs. Proposing aggressive preemption without quantifying wasted work or checkpoint overhead looks naive; estimate checkpoint frequency, time-to-recover, and net benefit.

Connections

Interviewers may pivot to autoscaling and cluster elasticity, data locality & network-aware placement, or scheduler implementation details (leader election, persistence, and high-availability). They might also ask about runtime prediction models for job durations or how to instrument scheduler metrics (e.g., wait distributions, fairness indices).