Design a monitoring system that collects metrics from 1,000 servers every 10 minutes.
The system should periodically contact each server, collect metrics such as CPU usage, memory usage, disk usage, and application health, and store the results for querying, dashboards, and alerting.
Focus especially on the worker design used to execute the metric-collection jobs. Address:
-
How jobs are scheduled every 10 minutes.
-
How workers are assigned to collect from the 1,000 servers.
-
How to implement the worker execution using multithreading or a worker pool.
-
How to handle timeouts, retries, partial failures, and slow servers.
-
How to avoid duplicate or overlapping collection runs.
-
How metrics are stored and made available for dashboards and alerts.
-
How the design would scale if the number of servers increased significantly.