Design a metrics monitoring system

Q: Design a metrics monitoring system

This is a System Design interview question from Current for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: Metrics Monitoring Platform

Context

Design a cloud‑native, multi‑tenant metrics monitoring system for internal services. The system must support counters, gauges, and histograms with labels/tags, ingest via pull and push, provide a query language, alerting, dashboards, and strong operational characteristics (HA, scale, quotas, isolation).

You may assume an illustrative scale (adjust as needed):

Aggregate ingest: up to 10M samples/sec across tenants.
Retention: 7 days hot, 12 months cold.
Query SLO: p99 < 2s for 6h range queries.
Availability target: 99.9%.

Requirements

Ingestion
- Collect numeric metrics (counter/gauge/histogram) with labels/tags.
- Support pull (scraping endpoints) and push ingestion.
- Handle high throughput and provide backpressure.
Storage
- Efficient time‑series storage with compression.
- Retention tiers: hot vs. cold storage; support downsampling.
Query
- Provide a query language for aggregations, label filtering, and downsampling.
- Support federated queries across hot/cold tiers.
Alerting
- Threshold and SLO‑based alerts; silencing, deduplication, routing.
Operations
- High availability, horizontal scalability, and multi‑tenant isolation.
- Control cardinality growth; enforce quotas and rate limits.
- Expose dashboards and APIs.
Architecture Deep Dives
- Discuss data model, sharding, indexing.
- Detail write/read paths, failure handling, and consistency choices.

Deliverables

End‑to‑end architecture proposal with components and data flow.
Rationale and trade‑offs for key design choices.
Guardrails for cardinality, quotas, and backpressure.
Failure scenarios and recovery strategies.
API surface and operability plan (dashboards, SLOs).

Design a metrics monitoring system

System Design: Metrics Monitoring Platform

Context

Requirements

Deliverables

Solution (Locked)

Comments (0)