You are asked to design a monitoring system used company-wide.
Goals
-
Collect and query telemetry for many services/hosts
-
Support alerting and dashboards for engineers and SREs
-
Handle high ingestion volume and multi-team usage
Telemetry types to cover
-
Metrics (time series)
-
Logs
-
Distributed traces (optional but strong)
Discuss
-
Key requirements (SLOs, latency, retention, tenancy)
-
High-level architecture and data flow
-
Storage choices and scaling
-
Alerting pipeline and reliability
-
Operational concerns (cardinality, cost controls, backpressure)