System Design: Metrics Monitoring Platform
Context
Design a cloud‑native, multi‑tenant metrics monitoring system for internal services. The system must support counters, gauges, and histograms with labels/tags, ingest via pull and push, provide a query language, alerting, dashboards, and strong operational characteristics (HA, scale, quotas, isolation).
You may assume an illustrative scale (adjust as needed):
-
Aggregate ingest: up to 10M samples/sec across tenants.
-
Retention: 7 days hot, 12 months cold.
-
Query SLO: p99 < 2s for 6h range queries.
-
Availability target: 99.9%.
Requirements
-
Ingestion
-
Collect numeric metrics (counter/gauge/histogram) with labels/tags.
-
Support pull (scraping endpoints) and push ingestion.
-
Handle high throughput and provide backpressure.
-
Storage
-
Efficient time‑series storage with compression.
-
Retention tiers: hot vs. cold storage; support downsampling.
-
Query
-
Provide a query language for aggregations, label filtering, and downsampling.
-
Support federated queries across hot/cold tiers.
-
Alerting
-
Threshold and SLO‑based alerts; silencing, deduplication, routing.
-
Operations
-
High availability, horizontal scalability, and multi‑tenant isolation.
-
Control cardinality growth; enforce quotas and rate limits.
-
Expose dashboards and APIs.
-
Architecture Deep Dives
-
Discuss data model, sharding, indexing.
-
Detail write/read paths, failure handling, and consistency choices.
Deliverables
-
End‑to‑end architecture proposal with components and data flow.
-
Rationale and trade‑offs for key design choices.
-
Guardrails for cardinality, quotas, and backpressure.
-
Failure scenarios and recovery strategies.
-
API surface and operability plan (dashboards, SLOs).