Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch
Company: NVIDIA
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
# Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch
You are building the **control plane** for a compute cluster that runs batch jobs (for example HPC / GPU workloads). A single logical **central service** sits above a fleet of **worker hosts**. The central service must continuously know the live state of every host — whether it is up, how busy it is, and how much spare capacity it has — so that it can dispatch jobs to hosts that have room.
Focus on the control plane, **not** the placement algorithm. Concretely, design:
- how the central service learns and tracks the state of each host,
- what data store holds host state and what the **schema** looks like,
- how the system stays correct and fast under **high write concurrency** (every host reporting frequently),
- how it **scales** as the fleet grows to 1,000+ hosts, and
- how it stays available when the central service node fails.
Assume the scheduler reads host state and picks a target host; the placement policy itself (bin-packing, priority, fairness) is out of scope.
### Constraints & Assumptions
- Fleet size: start at ~1,000 hosts; design to grow toward ~10,000.
- Each host reports its status periodically (heartbeat), e.g. every 1-5 seconds.
- Per-host state is small (CPU/GPU utilization, free memory, running-job count, health) — on the order of a few hundred bytes.
- A dead host must be detected within a few heartbeat intervals (single-digit seconds).
- Both writes (heartbeats) and reads (placement lookups) are frequent; placement can tolerate 1-2 s of staleness.
- Host state is reconstructable from heartbeats; it does not have to survive as durable record-of-truth unless stated otherwise.
- The scheduling/placement algorithm is out of scope.
### Clarifying Questions to Ask
- What heartbeat interval and failure-detection latency are acceptable? (Drives write QPS and timeout thresholds.)
- How fresh must host state be for placement — is 1-2 s of staleness fine, or must reads be strongly consistent?
- Is the fleet in a single data center / region, or geographically distributed?
- What is the durability requirement for host state — purely ephemeral, or must it survive a full control-plane restart?
- How many concurrent placement readers exist (one scheduler, or many)?
- What should happen to in-flight jobs when a host or the central service restarts?
### Part 1 — Host ↔ central service communication
Design the protocol by which the central service learns each host's status. Cover push vs pull, the heartbeat payload, how a new host registers, and how a dead/unresponsive host is detected and marked down.
```hint Direction of the connection
Prefer hosts *pushing* heartbeats to the central service over the central service *polling* 1,000+ hosts. Push scales with the fleet and gives you a free liveness signal — reason about what a *missed* heartbeat should mean.
```
```hint Failure detection
Don't declare a host dead on one missed beat. Use a timeout of N missed intervals, and consider a lightweight lease/TTL so "down" happens automatically when beats stop, instead of requiring the central service to actively probe.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Data store and schema
Choose the data store(s) for host state and define the schema. Justify the choice given the read/write pattern and freshness requirements.
```hint Match the store to the access pattern
The hot working set is one small row per host, keyed by host id, read and overwritten constantly. That points at an in-memory key-value store for the live table. Separately ask whether *any* of it must be durable.
```
#### Clarifying Questions for this Part
- Does any historical host metric need to be queried later (capacity planning, debugging), or only the latest state matters?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — High write concurrency
1,000+ hosts each writing every 1-5 s, plus placement reads. Explain how you keep updates correct and fast, and avoid the central service or the store becoming a bottleneck.
```hint Make each update independent
A heartbeat from host A never conflicts with host B. Exploit that: partition/shard by host id so writes spread out and never serialize on a single lock.
```
```hint Atomic per-key updates
If a heartbeat needs read-modify-write on a host's row (e.g. derived fields), do it atomically server-side — a Redis Lua script or a single atomic command — rather than read-then-write round trips that race.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 4 — Scaling and high availability
Scale the central service toward 10k hosts and remove the single point of failure.
```hint Stateless front, replicated state
Make the central service tier stateless so you can run many instances behind a load balancer, and push the hard state into a replicated store. The availability question then reduces to "how is the *store* replicated and failed over?"
```
```hint Replication + failover, with a caveat
Primary/replica with automatic failover (e.g. Redis Sentinel / Cluster) removes the SPOF — but name the trade-off: async replication can drop the last few heartbeats on failover, which is usually fine because heartbeats self-heal within one interval.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- A network partition cuts half the fleet off from the central service. Those hosts are alive but look "down." How do you prevent mass false-positive failures and a thundering-herd reschedule when the partition heals?
- The scheduler needs a consistent view of free capacity to avoid over-committing a host (two jobs placed into the same free slot). How do you make "read capacity + reserve it" atomic without serializing all reads?
- You now need 30 days of per-host utilization history for capacity planning. How do you add that without slowing the hot heartbeat path?
- A host's clock is skewed, so its reported `last_heartbeat_ts` looks stale or in the future. How do you make liveness detection robust to clock skew (server-assigned timestamps, monotonic generations)?
Quick Answer: This question assesses the ability to design a distributed control plane that tracks live host state across a large compute fleet under high write concurrency. It is commonly used in system design interviews to evaluate reasoning about data storage schema choices, scalability as fleet size grows, and failure detection and availability when a central coordinator goes down. It represents a practical, applied system design scenario rather than a purely conceptual one.