Design a Metrics Ingestion Pipeline
Company: Microsoft
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
Design a metrics ingestion pipeline for a large engineering organization. Thousands of services running across tens of thousands of hosts emit operational metrics (counters, gauges, and histograms with labels/tags). The pipeline must reliably collect these metrics, aggregate them, store them as time series, and serve them to dashboards and an alerting system. Pay particular attention to how the pipeline behaves under failure.
### Constraints & Assumptions
- Roughly 50,000 hosts and thousands of services; peak ingest around 5 million data points per second.
- Around 10 million active time series, each identified by a metric name plus a set of labels.
- Metric types: monotonic counters, point-in-time gauges, and histograms/summaries.
- Dashboard queries over recent data should return in well under a second; alerting needs fresh data within seconds.
- Retention of about 13 months with progressively coarser resolution (downsampling) for older data.
- High availability is required; bounded, well-understood data loss during incidents is acceptable, but the pipeline must never take down the services that emit metrics.
### Clarifying Questions to Ask
- Push (agents send to the pipeline) or pull (the pipeline scrapes targets)? Are we constrained to an existing agent on each host?
- What end-to-end latency is acceptable: near-real-time for alerting, or are minutes fine for most metrics?
- When the pipeline is overloaded, what is the preferred posture: drop data, buffer it, or apply backpressure to producers?
- What are the expected label-cardinality limits, and who owns/controls the label sets that services emit?
- What raw versus rolled-up resolutions and retention windows are required?
- Single region or multi-region? Multi-tenant (must one team's metrics be isolated from another's)?
### Part 1 — Requirements and high-level architecture
Lay out the functional and non-functional requirements and sketch the end-to-end architecture from the emitting service to a rendered dashboard.
```hint Separate the paths
Treat this as a streaming system and split the high-throughput write path (ingest, aggregate, store) from the latency-sensitive read path (query, alert). A durable buffer in the middle decouples producers from consumers.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Ingestion and storage at scale
Detail the write path and storage layout. How do you keep up with 5M points/sec, and how is the data physically stored for fast time-range queries?
```hint Partition by series
Route each time series consistently to the same partition/shard (hash of the series key) so aggregation and storage for a series are local and ordered. Pre-aggregate at the edge (the agent) to cut volume before it ever hits the gateway.
```
```hint The silent killer
Storage and query cost scale with the number of distinct time series, not just data points. Label cardinality is what blows up; treat it as a first-class constraint with limits and monitoring.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Reliability and failure modes
Walk through what happens when components fail or get overloaded, and how the pipeline degrades gracefully.
```hint Pick a posture per stage
At each stage decide: drop, buffer, or backpressure. Agents buffer locally when the gateway is unreachable; the durable queue absorbs downstream outages; the query layer sheds load rather than collapsing. Make writes idempotent so replay after a failure is safe.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- Where could ML/AI genuinely help operate this pipeline (for example anomaly detection on metric streams, auto-tuned alert thresholds, or predictive autoscaling of the ingestion tier), and what are the failure and trust risks of leaning on it?
- A bad deploy starts emitting a label whose value is a per-request unique ID, exploding cardinality 100x. How do you detect and contain it quickly without dropping the healthy metrics?
- How do you make counter semantics correct across agent restarts, network retries, and duplicate deliveries (counter resets, at-least-once delivery)?
- How do you serve a dashboard query that spans the full 13-month retention window quickly?
Quick Answer: This question evaluates a candidate's ability to design a large-scale distributed system for ingesting time-series metrics at high throughput. It tests reasoning about write/read path separation, storage partitioning, label cardinality control, and graceful degradation under failure, commonly used to assess senior-to-principal-level system design skill.