Design a metrics collection and alerting system (like a simplified monitoring platform).
Functional requirements:
-
Collect time-series metrics from many services/hosts (e.g., counters, gauges, timers).
-
Support
near real-time querying/dashboards
(low-latency queries over recent data).
-
Support
offline/analytical queries
over long time ranges (heavier aggregations, historical analysis).
-
Support
alerting
: user-defined rules (e.g., threshold, rate of change) with notifications.
Non-functional requirements:
-
High write throughput, horizontal scalability, high availability.
-
Handle spikes, backpressure, and partial failures.
-
Reasonable multi-tenancy and access control.
Provide an end-to-end architecture, key data models, storage choices, and how alert evaluation works. Discuss tradeoffs (e.g., Lambda/Kappa style), retention, and handling high-cardinality metrics.