PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/LinkedIn

Design a scalable metrics monitoring system

Last updated: Mar 29, 2026

Quick Overview

Design a scalable metrics monitoring system evaluates requirements, scale assumptions, API/data design, architecture, trade-offs, failure modes, and rollout in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.

  • hard
  • LinkedIn
  • System Design
  • Machine Learning Engineer

Design a scalable metrics monitoring system

Company: LinkedIn

Role: Machine Learning Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

Design a metrics monitoring system for large-scale services. Compare push vs pull collection models—when to choose each, and their impacts on reliability, backpressure, service discovery, network usage, and failure isolation. Describe the end-to-end architecture: client libraries/agents, ingestion, queueing, streaming aggregation, storage in a time-series database, alerting, dashboards, and SLOs. Propose aggregation and rollup strategies (client-side, agent-side, stream, storage-side), handling of high-cardinality labels, downsampling, late/out-of-order data, retention policies, and backfill. Provide a capacity plan, sharding and replication strategy, and multi-tenant isolation. Explain how you would test and monitor the system itself.

Quick Answer: Design a scalable metrics monitoring system evaluates requirements, scale assumptions, API/data design, architecture, trade-offs, failure modes, and rollout in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.

Related Interview Questions

  • Review a Web Application Architecture - LinkedIn (easy)
  • Scale a Distributed Randomized Multiset - LinkedIn (medium)
  • Design a Top-K Ranking Service - LinkedIn (easy)
  • Design a Global Calendar Service - LinkedIn (medium)
  • Design a malicious-URL checking service using an isMalicious API - LinkedIn (medium)
|Home/System Design/LinkedIn

Design a scalable metrics monitoring system

LinkedIn logo
LinkedIn
Jul 16, 2025, 12:00 AM
hardMachine Learning EngineerTechnical ScreenSystem Design
4
0

Design a scalable metrics monitoring system

Design a Metrics Monitoring System for Large-Scale Services

Context

You are designing a metrics monitoring system for large-scale, cloud-native microservices running across multiple regions and clusters. Services are ephemeral (containers/autoscaling), and the platform is multi-tenant (infra teams, ML/feature teams, product services). Assume on the order of tens of thousands of hosts and hundreds of thousands of service instances, with strict SLOs for data freshness and alerting.

Requirements

  1. Compare push vs. pull metrics collection models:
    • When to choose each.
    • Impacts on reliability, backpressure, service discovery, network usage, and failure isolation.
  2. Describe the end-to-end architecture:
    • Client libraries/agents (e.g., SDK or node agent/sidecar).
    • Ingestion layer (APIs, gateways), queueing, and streaming aggregation.
    • Time-series storage, query layer, alerting, dashboards, and SLOs.
  3. Propose aggregation and rollup strategies at each layer:
    • Client-side, agent-side, stream processors, storage-side.
    • Handling high-cardinality labels, downsampling, late/out-of-order data, retention policies, and backfill.
  4. Provide a capacity plan, sharding and replication strategy, and multi-tenant isolation.
  5. Explain how you would test and monitor the monitoring system itself.

Make minimal, explicit assumptions as needed and call out trade-offs and guardrails.

Constraints & Assumptions

  • Preserve the scope, facts, inputs, and requested outputs from the prompt above.
  • If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
  • Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.

Clarifying Questions to Ask

  • Clarify users, core use cases, read/write patterns, scale, latency, availability, and data retention.
  • State explicit assumptions before making sizing or architecture decisions.
  • Prioritize the functional path first, then address reliability, security, observability, and rollout.

What a Strong Answer Covers

  • A scoped requirements summary with concrete non-goals and success metrics.
  • API, data model, architecture, consistency, capacity, and operations.
  • Reasoned trade-offs among simple and scalable designs, including bottlenecks and failure modes.
  • A validation, monitoring, migration, and launch plan appropriate for the risk level.

Follow-up Questions

  • What breaks first at 10x traffic or data volume?
  • How would you degrade gracefully during dependency failures?
  • What metrics and alerts would prove the design is healthy after launch?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More LinkedIn•More Machine Learning Engineer•LinkedIn Machine Learning Engineer•LinkedIn System Design•Machine Learning Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.