How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Onsite rounds at Microsoft.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Microsoft during technical interviews.

Design a Metrics Ingestion Pipeline | Microsoft Interview Question

Q: Design a Metrics Ingestion Pipeline

This question evaluates a candidate's ability to design a large-scale distributed system for ingesting time-series metrics at high throughput. It tests reasoning about write/read path separation, storage partitioning, label cardinality control, and graceful degradation under failure, commonly used to assess senior-to-principal-level system design skill.

Design a metrics ingestion pipeline for a large engineering organization. Thousands of services running across tens of thousands of hosts emit operational metrics (counters, gauges, and histograms with labels/tags). The pipeline must reliably collect these metrics, aggregate them, store them as time series, and serve them to dashboards and an alerting system. Pay particular attention to how the pipeline behaves under failure.

Constraints & Assumptions

Roughly 50,000 hosts and thousands of services; peak ingest around 5 million data points per second.
Around 10 million active time series, each identified by a metric name plus a set of labels.
Metric types: monotonic counters, point-in-time gauges, and histograms/summaries.
Dashboard queries over recent data should return in well under a second; alerting needs fresh data within seconds.
Retention of about 13 months with progressively coarser resolution (downsampling) for older data.
High availability is required; bounded, well-understood data loss during incidents is acceptable, but the pipeline must never take down the services that emit metrics.

Clarifying Questions to Ask

Push (agents send to the pipeline) or pull (the pipeline scrapes targets)? Are we constrained to an existing agent on each host?
What end-to-end latency is acceptable: near-real-time for alerting, or are minutes fine for most metrics?
When the pipeline is overloaded, what is the preferred posture: drop data, buffer it, or apply backpressure to producers?
What are the expected label-cardinality limits, and who owns/controls the label sets that services emit?
What raw versus rolled-up resolutions and retention windows are required?
Single region or multi-region? Multi-tenant (must one team's metrics be isolated from another's)?

Part 1 — Requirements and high-level architecture

Lay out the functional and non-functional requirements and sketch the end-to-end architecture from the emitting service to a rendered dashboard.

What This Part Should Cover Premium

Part 2 — Ingestion and storage at scale

Detail the write path and storage layout. How do you keep up with 5M points/sec, and how is the data physically stored for fast time-range queries?

What This Part Should Cover Premium

Part 3 — Reliability and failure modes

Walk through what happens when components fail or get overloaded, and how the pipeline degrades gracefully.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Where could ML/AI genuinely help operate this pipeline (for example anomaly detection on metric streams, auto-tuned alert thresholds, or predictive autoscaling of the ingestion tier), and what are the failure and trust risks of leaning on it?
A bad deploy starts emitting a label whose value is a per-request unique ID, exploding cardinality 100x. How do you detect and contain it quickly without dropping the healthy metrics?
How do you make counter semantics correct across agent restarts, network retries, and duplicate deliveries (counter resets, at-least-once delivery)?
How do you serve a dashboard query that spans the full 13-month retention window quickly?

Constraints & Assumptions

Roughly 50,000 hosts and thousands of services; peak ingest around 5 million data points per second.
Around 10 million active time series, each identified by a metric name plus a set of labels.
Metric types: monotonic counters, point-in-time gauges, and histograms/summaries.
Dashboard queries over recent data should return in well under a second; alerting needs fresh data within seconds.
Retention of about 13 months with progressively coarser resolution (downsampling) for older data.
High availability is required; bounded, well-understood data loss during incidents is acceptable, but the pipeline must never take down the services that emit metrics.

Clarifying Questions to Ask

Push (agents send to the pipeline) or pull (the pipeline scrapes targets)? Are we constrained to an existing agent on each host?
What end-to-end latency is acceptable: near-real-time for alerting, or are minutes fine for most metrics?
When the pipeline is overloaded, what is the preferred posture: drop data, buffer it, or apply backpressure to producers?
What are the expected label-cardinality limits, and who owns/controls the label sets that services emit?
What raw versus rolled-up resolutions and retention windows are required?
Single region or multi-region? Multi-tenant (must one team's metrics be isolated from another's)?

Part 1 — Requirements and high-level architecture

Lay out the functional and non-functional requirements and sketch the end-to-end architecture from the emitting service to a rendered dashboard.

What This Part Should Cover Premium

Part 2 — Ingestion and storage at scale

Detail the write path and storage layout. How do you keep up with 5M points/sec, and how is the data physically stored for fast time-range queries?

What This Part Should Cover Premium

Part 3 — Reliability and failure modes

Walk through what happens when components fail or get overloaded, and how the pipeline degrades gracefully.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Where could ML/AI genuinely help operate this pipeline (for example anomaly detection on metric streams, auto-tuned alert thresholds, or predictive autoscaling of the ingestion tier), and what are the failure and trust risks of leaning on it?
A bad deploy starts emitting a label whose value is a per-request unique ID, exploding cardinality 100x. How do you detect and contain it quickly without dropping the healthy metrics?
How do you make counter semantics correct across agent restarts, network retries, and duplicate deliveries (counter resets, at-least-once delivery)?
How do you serve a dashboard query that spans the full 13-month retention window quickly?

Design a Metrics Ingestion Pipeline

Quick Overview

Design a Metrics Ingestion Pipeline

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Requirements and high-level architecture

What This Part Should Cover Premium

Part 2 — Ingestion and storage at scale

What This Part Should Cover Premium

Part 3 — Reliability and failure modes

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a Metrics Ingestion Pipeline

Quick Overview

Design a Metrics Ingestion Pipeline

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Requirements and high-level architecture

What This Part Should Cover Premium

Part 2 — Ingestion and storage at scale

What This Part Should Cover Premium

Part 3 — Reliability and failure modes

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP