PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/Microsoft

Design a Metrics Ingestion Pipeline

Last updated: Jul 1, 2026

Quick Overview

This question evaluates a candidate's ability to design a large-scale distributed system for ingesting time-series metrics at high throughput. It tests reasoning about write/read path separation, storage partitioning, label cardinality control, and graceful degradation under failure, commonly used to assess senior-to-principal-level system design skill.

  • medium
  • Microsoft
  • System Design
  • Software Engineer

Design a Metrics Ingestion Pipeline

Company: Microsoft

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Onsite

Design a metrics ingestion pipeline for a large engineering organization. Thousands of services running across tens of thousands of hosts emit operational metrics (counters, gauges, and histograms with labels/tags). The pipeline must reliably collect these metrics, aggregate them, store them as time series, and serve them to dashboards and an alerting system. Pay particular attention to how the pipeline behaves under failure. ### Constraints & Assumptions - Roughly 50,000 hosts and thousands of services; peak ingest around 5 million data points per second. - Around 10 million active time series, each identified by a metric name plus a set of labels. - Metric types: monotonic counters, point-in-time gauges, and histograms/summaries. - Dashboard queries over recent data should return in well under a second; alerting needs fresh data within seconds. - Retention of about 13 months with progressively coarser resolution (downsampling) for older data. - High availability is required; bounded, well-understood data loss during incidents is acceptable, but the pipeline must never take down the services that emit metrics. ### Clarifying Questions to Ask - Push (agents send to the pipeline) or pull (the pipeline scrapes targets)? Are we constrained to an existing agent on each host? - What end-to-end latency is acceptable: near-real-time for alerting, or are minutes fine for most metrics? - When the pipeline is overloaded, what is the preferred posture: drop data, buffer it, or apply backpressure to producers? - What are the expected label-cardinality limits, and who owns/controls the label sets that services emit? - What raw versus rolled-up resolutions and retention windows are required? - Single region or multi-region? Multi-tenant (must one team's metrics be isolated from another's)? ### Part 1 — Requirements and high-level architecture Lay out the functional and non-functional requirements and sketch the end-to-end architecture from the emitting service to a rendered dashboard. ```hint Separate the paths Treat this as a streaming system and split the high-throughput write path (ingest, aggregate, store) from the latency-sensitive read path (query, alert). A durable buffer in the middle decouples producers from consumers. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 — Ingestion and storage at scale Detail the write path and storage layout. How do you keep up with 5M points/sec, and how is the data physically stored for fast time-range queries? ```hint Partition by series Route each time series consistently to the same partition/shard (hash of the series key) so aggregation and storage for a series are local and ordered. Pre-aggregate at the edge (the agent) to cut volume before it ever hits the gateway. ``` ```hint The silent killer Storage and query cost scale with the number of distinct time series, not just data points. Label cardinality is what blows up; treat it as a first-class constraint with limits and monitoring. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 — Reliability and failure modes Walk through what happens when components fail or get overloaded, and how the pipeline degrades gracefully. ```hint Pick a posture per stage At each stage decide: drop, buffer, or backpressure. Agents buffer locally when the gateway is unreachable; the durable queue absorbs downstream outages; the query layer sheds load rather than collapsing. Make writes idempotent so replay after a failure is safe. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - Where could ML/AI genuinely help operate this pipeline (for example anomaly detection on metric streams, auto-tuned alert thresholds, or predictive autoscaling of the ingestion tier), and what are the failure and trust risks of leaning on it? - A bad deploy starts emitting a label whose value is a per-request unique ID, exploding cardinality 100x. How do you detect and contain it quickly without dropping the healthy metrics? - How do you make counter semantics correct across agent restarts, network retries, and duplicate deliveries (counter resets, at-least-once delivery)? - How do you serve a dashboard query that spans the full 13-month retention window quickly?

Quick Answer: This question evaluates a candidate's ability to design a large-scale distributed system for ingesting time-series metrics at high throughput. It tests reasoning about write/read path separation, storage partitioning, label cardinality control, and graceful degradation under failure, commonly used to assess senior-to-principal-level system design skill.

Related Interview Questions

  • Design a URL Shortener (High-Level and Low-Level Design) - Microsoft (medium)
  • Externally Sort a 500 GB CSV by One Column with 16 GB of RAM - Microsoft (medium)
  • Design a Distributed Key-Value Store - Microsoft (medium)
  • Design a To-Do List Service (CRUD, Auth, Rate Limiting, Caching & API Versioning) - Microsoft (medium)
  • Design A Scalable Web Crawler - Microsoft (medium)
|Home/System Design/Microsoft

Design a Metrics Ingestion Pipeline

Microsoft logo
Microsoft
Jun 11, 2026, 12:00 AM
mediumSoftware EngineerOnsiteSystem Design
0
0

Design a metrics ingestion pipeline for a large engineering organization. Thousands of services running across tens of thousands of hosts emit operational metrics (counters, gauges, and histograms with labels/tags). The pipeline must reliably collect these metrics, aggregate them, store them as time series, and serve them to dashboards and an alerting system. Pay particular attention to how the pipeline behaves under failure.

Constraints & Assumptions

  • Roughly 50,000 hosts and thousands of services; peak ingest around 5 million data points per second.
  • Around 10 million active time series, each identified by a metric name plus a set of labels.
  • Metric types: monotonic counters, point-in-time gauges, and histograms/summaries.
  • Dashboard queries over recent data should return in well under a second; alerting needs fresh data within seconds.
  • Retention of about 13 months with progressively coarser resolution (downsampling) for older data.
  • High availability is required; bounded, well-understood data loss during incidents is acceptable, but the pipeline must never take down the services that emit metrics.

Clarifying Questions to Ask

  • Push (agents send to the pipeline) or pull (the pipeline scrapes targets)? Are we constrained to an existing agent on each host?
  • What end-to-end latency is acceptable: near-real-time for alerting, or are minutes fine for most metrics?
  • When the pipeline is overloaded, what is the preferred posture: drop data, buffer it, or apply backpressure to producers?
  • What are the expected label-cardinality limits, and who owns/controls the label sets that services emit?
  • What raw versus rolled-up resolutions and retention windows are required?
  • Single region or multi-region? Multi-tenant (must one team's metrics be isolated from another's)?

Part 1 — Requirements and high-level architecture

Lay out the functional and non-functional requirements and sketch the end-to-end architecture from the emitting service to a rendered dashboard.

What This Part Should Cover Premium

Part 2 — Ingestion and storage at scale

Detail the write path and storage layout. How do you keep up with 5M points/sec, and how is the data physically stored for fast time-range queries?

What This Part Should Cover Premium

Part 3 — Reliability and failure modes

Walk through what happens when components fail or get overloaded, and how the pipeline degrades gracefully.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • Where could ML/AI genuinely help operate this pipeline (for example anomaly detection on metric streams, auto-tuned alert thresholds, or predictive autoscaling of the ingestion tier), and what are the failure and trust risks of leaning on it?
  • A bad deploy starts emitting a label whose value is a per-request unique ID, exploding cardinality 100x. How do you detect and contain it quickly without dropping the healthy metrics?
  • How do you make counter semantics correct across agent restarts, network retries, and duplicate deliveries (counter resets, at-least-once delivery)?
  • How do you serve a dashboard query that spans the full 13-month retention window quickly?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Microsoft•More Software Engineer•Microsoft Software Engineer•Microsoft System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.