Design a Metrics Monitoring System for Large-Scale Services
Context
You are designing a metrics monitoring system for large-scale, cloud-native microservices running across multiple regions and clusters. Services are ephemeral (containers/autoscaling), and the platform is multi-tenant (infra teams, ML/feature teams, product services). Assume on the order of tens of thousands of hosts and hundreds of thousands of service instances, with strict SLOs for data freshness and alerting.
Requirements
-
Compare push vs. pull metrics collection models:
-
When to choose each.
-
Impacts on reliability, backpressure, service discovery, network usage, and failure isolation.
-
Describe the end-to-end architecture:
-
Client libraries/agents (e.g., SDK or node agent/sidecar).
-
Ingestion layer (APIs, gateways), queueing, and streaming aggregation.
-
Time-series storage, query layer, alerting, dashboards, and SLOs.
-
Propose aggregation and rollup strategies at each layer:
-
Client-side, agent-side, stream processors, storage-side.
-
Handling high-cardinality labels, downsampling, late/out-of-order data, retention policies, and backfill.
-
Provide a capacity plan, sharding and replication strategy, and multi-tenant isolation.
-
Explain how you would test and monitor the monitoring system itself.
Make minimal, explicit assumptions as needed and call out trade-offs and guardrails.