Design a production system that can safely batch reboot N machines in a fleet.
Context: You operate a large fleet of machines used for production workloads. Operators need a reliable way to reboot many machines, for example after kernel upgrades, hardware remediation, firmware updates, or node recovery. The system must avoid taking down too much capacity at once and must provide visibility into progress and failures.
Address the following:
-
How users submit a batch reboot request.
-
How the system selects and validates the target machines.
-
How to schedule reboots in safe batches or waves.
-
How to prevent service-impacting outages.
-
How to track machine state before, during, and after reboot.
-
How to handle failures, retries, timeouts, and partial completion.
-
How the system should integrate with infrastructure such as Kubernetes or a machine inventory service.
-
What observability, auditability, and safety controls are required.