Design a scalable metrics monitoring system

Q: Design a scalable metrics monitoring system

This is a System Design interview question from LinkedIn for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Design a Metrics Monitoring System for Large-Scale Services

Context

You are designing a metrics monitoring system for large-scale, cloud-native microservices running across multiple regions and clusters. Services are ephemeral (containers/autoscaling), and the platform is multi-tenant (infra teams, ML/feature teams, product services). Assume on the order of tens of thousands of hosts and hundreds of thousands of service instances, with strict SLOs for data freshness and alerting.

Requirements

Compare push vs. pull metrics collection models:
- When to choose each.
- Impacts on reliability, backpressure, service discovery, network usage, and failure isolation.
Describe the end-to-end architecture:
- Client libraries/agents (e.g., SDK or node agent/sidecar).
- Ingestion layer (APIs, gateways), queueing, and streaming aggregation.
- Time-series storage, query layer, alerting, dashboards, and SLOs.
Propose aggregation and rollup strategies at each layer:
- Client-side, agent-side, stream processors, storage-side.
- Handling high-cardinality labels, downsampling, late/out-of-order data, retention policies, and backfill.
Provide a capacity plan, sharding and replication strategy, and multi-tenant isolation.
Explain how you would test and monitor the monitoring system itself.

Make minimal, explicit assumptions as needed and call out trade-offs and guardrails.

Design a scalable metrics monitoring system

Design a Metrics Monitoring System for Large-Scale Services

Context

Requirements

Solution

Comments (0)