How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Technical Screen rounds at NVIDIA.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at NVIDIA during technical interviews.

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

Q: Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

This question assesses the ability to design a distributed control plane that tracks live host state across a large compute fleet under high write concurrency. It is commonly used in system design interviews to evaluate reasoning about data storage schema choices, scalability as fleet size grows, and failure detection and availability when a central coordinator goes down. It represents a practical, applied system design scenario rather than a purely conceptual one.

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

You are building the control plane for a compute cluster that runs batch jobs (for example HPC / GPU workloads). A single logical central service sits above a fleet of worker hosts. The central service must continuously know the live state of every host — whether it is up, how busy it is, and how much spare capacity it has — so that it can dispatch jobs to hosts that have room.

Focus on the control plane, not the placement algorithm. Concretely, design:

how the central service learns and tracks the state of each host,
what data store holds host state and what the schema looks like,
how the system stays correct and fast under high write concurrency (every host reporting frequently),
how it scales as the fleet grows to 1,000+ hosts, and
how it stays available when the central service node fails.

Assume the scheduler reads host state and picks a target host; the placement policy itself (bin-packing, priority, fairness) is out of scope.

Constraints & Assumptions

Fleet size: start at ~1,000 hosts; design to grow toward ~10,000.
Each host reports its status periodically (heartbeat), e.g. every 1-5 seconds.
Per-host state is small (CPU/GPU utilization, free memory, running-job count, health) — on the order of a few hundred bytes.
A dead host must be detected within a few heartbeat intervals (single-digit seconds).
Both writes (heartbeats) and reads (placement lookups) are frequent; placement can tolerate 1-2 s of staleness.
Host state is reconstructable from heartbeats; it does not have to survive as durable record-of-truth unless stated otherwise.
The scheduling/placement algorithm is out of scope.

Clarifying Questions to Ask

What heartbeat interval and failure-detection latency are acceptable? (Drives write QPS and timeout thresholds.)
How fresh must host state be for placement — is 1-2 s of staleness fine, or must reads be strongly consistent?
Is the fleet in a single data center / region, or geographically distributed?
What is the durability requirement for host state — purely ephemeral, or must it survive a full control-plane restart?
How many concurrent placement readers exist (one scheduler, or many)?
What should happen to in-flight jobs when a host or the central service restarts?

Part 1 — Host ↔ central service communication

Design the protocol by which the central service learns each host's status. Cover push vs pull, the heartbeat payload, how a new host registers, and how a dead/unresponsive host is detected and marked down.

What This Part Should Cover Premium

Part 2 — Data store and schema

Choose the data store(s) for host state and define the schema. Justify the choice given the read/write pattern and freshness requirements.

Clarifying Questions for this Part

Does any historical host metric need to be queried later (capacity planning, debugging), or only the latest state matters?

What This Part Should Cover Premium

Part 3 — High write concurrency

1,000+ hosts each writing every 1-5 s, plus placement reads. Explain how you keep updates correct and fast, and avoid the central service or the store becoming a bottleneck.

What This Part Should Cover Premium

Part 4 — Scaling and high availability

Scale the central service toward 10k hosts and remove the single point of failure.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

A network partition cuts half the fleet off from the central service. Those hosts are alive but look "down." How do you prevent mass false-positive failures and a thundering-herd reschedule when the partition heals?
The scheduler needs a consistent view of free capacity to avoid over-committing a host (two jobs placed into the same free slot). How do you make "read capacity + reserve it" atomic without serializing all reads?
You now need 30 days of per-host utilization history for capacity planning. How do you add that without slowing the hot heartbeat path?
A host's clock is skewed, so its reported last_heartbeat_ts looks stale or in the future. How do you make liveness detection robust to clock skew (server-assigned timestamps, monotonic generations)?

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

Focus on the control plane, not the placement algorithm. Concretely, design:

how the central service learns and tracks the state of each host,
what data store holds host state and what the schema looks like,
how the system stays correct and fast under high write concurrency (every host reporting frequently),
how it scales as the fleet grows to 1,000+ hosts, and
how it stays available when the central service node fails.

Assume the scheduler reads host state and picks a target host; the placement policy itself (bin-packing, priority, fairness) is out of scope.

Constraints & Assumptions

Fleet size: start at ~1,000 hosts; design to grow toward ~10,000.
Each host reports its status periodically (heartbeat), e.g. every 1-5 seconds.
Per-host state is small (CPU/GPU utilization, free memory, running-job count, health) — on the order of a few hundred bytes.
A dead host must be detected within a few heartbeat intervals (single-digit seconds).
Both writes (heartbeats) and reads (placement lookups) are frequent; placement can tolerate 1-2 s of staleness.
Host state is reconstructable from heartbeats; it does not have to survive as durable record-of-truth unless stated otherwise.
The scheduling/placement algorithm is out of scope.

Clarifying Questions to Ask

What heartbeat interval and failure-detection latency are acceptable? (Drives write QPS and timeout thresholds.)
How fresh must host state be for placement — is 1-2 s of staleness fine, or must reads be strongly consistent?
Is the fleet in a single data center / region, or geographically distributed?
What is the durability requirement for host state — purely ephemeral, or must it survive a full control-plane restart?
How many concurrent placement readers exist (one scheduler, or many)?
What should happen to in-flight jobs when a host or the central service restarts?

Part 1 — Host ↔ central service communication

What This Part Should Cover Premium

Part 2 — Data store and schema

Choose the data store(s) for host state and define the schema. Justify the choice given the read/write pattern and freshness requirements.

Clarifying Questions for this Part

Does any historical host metric need to be queried later (capacity planning, debugging), or only the latest state matters?

What This Part Should Cover Premium

Part 3 — High write concurrency

1,000+ hosts each writing every 1-5 s, plus placement reads. Explain how you keep updates correct and fast, and avoid the central service or the store becoming a bottleneck.

What This Part Should Cover Premium

Part 4 — Scaling and high availability

Scale the central service toward 10k hosts and remove the single point of failure.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

A network partition cuts half the fleet off from the central service. Those hosts are alive but look "down." How do you prevent mass false-positive failures and a thundering-herd reschedule when the partition heals?
The scheduler needs a consistent view of free capacity to avoid over-committing a host (two jobs placed into the same free slot). How do you make "read capacity + reserve it" atomic without serializing all reads?
You now need 30 days of per-host utilization history for capacity planning. How do you add that without slowing the hot heartbeat path?
A host's clock is skewed, so its reported last_heartbeat_ts looks stale or in the future. How do you make liveness detection robust to clock skew (server-assigned timestamps, monotonic generations)?

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

Quick Overview

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Host ↔ central service communication

What This Part Should Cover Premium

Part 2 — Data store and schema

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3 — High write concurrency

What This Part Should Cover Premium

Part 4 — Scaling and high availability

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

Quick Overview

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Host ↔ central service communication

What This Part Should Cover Premium

Part 2 — Data store and schema

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3 — High write concurrency

What This Part Should Cover Premium

Part 4 — Scaling and high availability

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP