PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/NVIDIA

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

Last updated: Jul 1, 2026

Quick Overview

This question assesses the ability to design a distributed control plane that tracks live host state across a large compute fleet under high write concurrency. It is commonly used in system design interviews to evaluate reasoning about data storage schema choices, scalability as fleet size grows, and failure detection and availability when a central coordinator goes down. It represents a practical, applied system design scenario rather than a purely conceptual one.

  • medium
  • NVIDIA
  • System Design
  • Software Engineer

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

Company: NVIDIA

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

# Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch You are building the **control plane** for a compute cluster that runs batch jobs (for example HPC / GPU workloads). A single logical **central service** sits above a fleet of **worker hosts**. The central service must continuously know the live state of every host — whether it is up, how busy it is, and how much spare capacity it has — so that it can dispatch jobs to hosts that have room. Focus on the control plane, **not** the placement algorithm. Concretely, design: - how the central service learns and tracks the state of each host, - what data store holds host state and what the **schema** looks like, - how the system stays correct and fast under **high write concurrency** (every host reporting frequently), - how it **scales** as the fleet grows to 1,000+ hosts, and - how it stays available when the central service node fails. Assume the scheduler reads host state and picks a target host; the placement policy itself (bin-packing, priority, fairness) is out of scope. ### Constraints & Assumptions - Fleet size: start at ~1,000 hosts; design to grow toward ~10,000. - Each host reports its status periodically (heartbeat), e.g. every 1-5 seconds. - Per-host state is small (CPU/GPU utilization, free memory, running-job count, health) — on the order of a few hundred bytes. - A dead host must be detected within a few heartbeat intervals (single-digit seconds). - Both writes (heartbeats) and reads (placement lookups) are frequent; placement can tolerate 1-2 s of staleness. - Host state is reconstructable from heartbeats; it does not have to survive as durable record-of-truth unless stated otherwise. - The scheduling/placement algorithm is out of scope. ### Clarifying Questions to Ask - What heartbeat interval and failure-detection latency are acceptable? (Drives write QPS and timeout thresholds.) - How fresh must host state be for placement — is 1-2 s of staleness fine, or must reads be strongly consistent? - Is the fleet in a single data center / region, or geographically distributed? - What is the durability requirement for host state — purely ephemeral, or must it survive a full control-plane restart? - How many concurrent placement readers exist (one scheduler, or many)? - What should happen to in-flight jobs when a host or the central service restarts? ### Part 1 — Host ↔ central service communication Design the protocol by which the central service learns each host's status. Cover push vs pull, the heartbeat payload, how a new host registers, and how a dead/unresponsive host is detected and marked down. ```hint Direction of the connection Prefer hosts *pushing* heartbeats to the central service over the central service *polling* 1,000+ hosts. Push scales with the fleet and gives you a free liveness signal — reason about what a *missed* heartbeat should mean. ``` ```hint Failure detection Don't declare a host dead on one missed beat. Use a timeout of N missed intervals, and consider a lightweight lease/TTL so "down" happens automatically when beats stop, instead of requiring the central service to actively probe. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 — Data store and schema Choose the data store(s) for host state and define the schema. Justify the choice given the read/write pattern and freshness requirements. ```hint Match the store to the access pattern The hot working set is one small row per host, keyed by host id, read and overwritten constantly. That points at an in-memory key-value store for the live table. Separately ask whether *any* of it must be durable. ``` #### Clarifying Questions for this Part - Does any historical host metric need to be queried later (capacity planning, debugging), or only the latest state matters? #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 — High write concurrency 1,000+ hosts each writing every 1-5 s, plus placement reads. Explain how you keep updates correct and fast, and avoid the central service or the store becoming a bottleneck. ```hint Make each update independent A heartbeat from host A never conflicts with host B. Exploit that: partition/shard by host id so writes spread out and never serialize on a single lock. ``` ```hint Atomic per-key updates If a heartbeat needs read-modify-write on a host's row (e.g. derived fields), do it atomically server-side — a Redis Lua script or a single atomic command — rather than read-then-write round trips that race. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 4 — Scaling and high availability Scale the central service toward 10k hosts and remove the single point of failure. ```hint Stateless front, replicated state Make the central service tier stateless so you can run many instances behind a load balancer, and push the hard state into a replicated store. The availability question then reduces to "how is the *store* replicated and failed over?" ``` ```hint Replication + failover, with a caveat Primary/replica with automatic failover (e.g. Redis Sentinel / Cluster) removes the SPOF — but name the trade-off: async replication can drop the last few heartbeats on failover, which is usually fine because heartbeats self-heal within one interval. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - A network partition cuts half the fleet off from the central service. Those hosts are alive but look "down." How do you prevent mass false-positive failures and a thundering-herd reschedule when the partition heals? - The scheduler needs a consistent view of free capacity to avoid over-committing a host (two jobs placed into the same free slot). How do you make "read capacity + reserve it" atomic without serializing all reads? - You now need 30 days of per-host utilization history for capacity planning. How do you add that without slowing the hot heartbeat path? - A host's clock is skewed, so its reported `last_heartbeat_ts` looks stale or in the future. How do you make liveness detection robust to clock skew (server-assigned timestamps, monotonic generations)?

Quick Answer: This question assesses the ability to design a distributed control plane that tracks live host state across a large compute fleet under high write concurrency. It is commonly used in system design interviews to evaluate reasoning about data storage schema choices, scalability as fleet size grows, and failure detection and availability when a central coordinator goes down. It represents a practical, applied system design scenario rather than a purely conceptual one.

Related Interview Questions

  • Design a URL shortening service - NVIDIA (hard)
  • Design a bidirectional data sync dashboard - NVIDIA (medium)
  • Design first-time Kubernetes deployment in new cloud - NVIDIA (medium)
  • Design a distributed multi-user counter - NVIDIA (hard)
  • Design an artifact store on K8s and Cassandra - NVIDIA (hard)
|Home/System Design/NVIDIA

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

NVIDIA logo
NVIDIA
Jun 20, 2026, 12:00 AM
mediumSoftware EngineerTechnical ScreenSystem Design
0
0

Design the Control Plane for a Compute Cluster: Host Monitoring + Job Dispatch

You are building the control plane for a compute cluster that runs batch jobs (for example HPC / GPU workloads). A single logical central service sits above a fleet of worker hosts. The central service must continuously know the live state of every host — whether it is up, how busy it is, and how much spare capacity it has — so that it can dispatch jobs to hosts that have room.

Focus on the control plane, not the placement algorithm. Concretely, design:

  • how the central service learns and tracks the state of each host,
  • what data store holds host state and what the schema looks like,
  • how the system stays correct and fast under high write concurrency (every host reporting frequently),
  • how it scales as the fleet grows to 1,000+ hosts, and
  • how it stays available when the central service node fails.

Assume the scheduler reads host state and picks a target host; the placement policy itself (bin-packing, priority, fairness) is out of scope.

Constraints & Assumptions

  • Fleet size: start at ~1,000 hosts; design to grow toward ~10,000.
  • Each host reports its status periodically (heartbeat), e.g. every 1-5 seconds.
  • Per-host state is small (CPU/GPU utilization, free memory, running-job count, health) — on the order of a few hundred bytes.
  • A dead host must be detected within a few heartbeat intervals (single-digit seconds).
  • Both writes (heartbeats) and reads (placement lookups) are frequent; placement can tolerate 1-2 s of staleness.
  • Host state is reconstructable from heartbeats; it does not have to survive as durable record-of-truth unless stated otherwise.
  • The scheduling/placement algorithm is out of scope.

Clarifying Questions to Ask

  • What heartbeat interval and failure-detection latency are acceptable? (Drives write QPS and timeout thresholds.)
  • How fresh must host state be for placement — is 1-2 s of staleness fine, or must reads be strongly consistent?
  • Is the fleet in a single data center / region, or geographically distributed?
  • What is the durability requirement for host state — purely ephemeral, or must it survive a full control-plane restart?
  • How many concurrent placement readers exist (one scheduler, or many)?
  • What should happen to in-flight jobs when a host or the central service restarts?

Part 1 — Host ↔ central service communication

Design the protocol by which the central service learns each host's status. Cover push vs pull, the heartbeat payload, how a new host registers, and how a dead/unresponsive host is detected and marked down.

What This Part Should Cover Premium

Part 2 — Data store and schema

Choose the data store(s) for host state and define the schema. Justify the choice given the read/write pattern and freshness requirements.

Clarifying Questions for this Part

  • Does any historical host metric need to be queried later (capacity planning, debugging), or only the latest state matters?

What This Part Should Cover Premium

Part 3 — High write concurrency

1,000+ hosts each writing every 1-5 s, plus placement reads. Explain how you keep updates correct and fast, and avoid the central service or the store becoming a bottleneck.

What This Part Should Cover Premium

Part 4 — Scaling and high availability

Scale the central service toward 10k hosts and remove the single point of failure.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • A network partition cuts half the fleet off from the central service. Those hosts are alive but look "down." How do you prevent mass false-positive failures and a thundering-herd reschedule when the partition heals?
  • The scheduler needs a consistent view of free capacity to avoid over-committing a host (two jobs placed into the same free slot). How do you make "read capacity + reserve it" atomic without serializing all reads?
  • You now need 30 days of per-host utilization history for capacity planning. How do you add that without slowing the hot heartbeat path?
  • A host's clock is skewed, so its reported last_heartbeat_ts looks stale or in the future. How do you make liveness detection robust to clock skew (server-assigned timestamps, monotonic generations)?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More NVIDIA•More Software Engineer•NVIDIA Software Engineer•NVIDIA System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.