PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/System Design/LinkedIn

Design a scalable metrics monitoring system

Last updated: Mar 29, 2026

Quick Overview

This question evaluates the ability to design scalable, multi-tenant metrics monitoring systems, testing competencies in system architecture, ingestion and aggregation pipelines, time-series storage, alerting, capacity planning, and operational observability; it falls under the System Design domain and targets practical, architecture-level application rather than low-level coding. Interviewers commonly ask this to probe reasoning about trade-offs in cloud-native environments—such as collection models, sharding and replication, high-cardinality handling, retention and backfill, and monitoring-the-monitoring—while assessing judgment on reliability, latency, and tenant isolation; this summary is in English.

  • hard
  • LinkedIn
  • System Design
  • Machine Learning Engineer

Design a scalable metrics monitoring system

Company: LinkedIn

Role: Machine Learning Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

Design a metrics monitoring system for large-scale services. Compare push vs pull collection models—when to choose each, and their impacts on reliability, backpressure, service discovery, network usage, and failure isolation. Describe the end-to-end architecture: client libraries/agents, ingestion, queueing, streaming aggregation, storage in a time-series database, alerting, dashboards, and SLOs. Propose aggregation and rollup strategies (client-side, agent-side, stream, storage-side), handling of high-cardinality labels, downsampling, late/out-of-order data, retention policies, and backfill. Provide a capacity plan, sharding and replication strategy, and multi-tenant isolation. Explain how you would test and monitor the system itself.

Quick Answer: This question evaluates the ability to design scalable, multi-tenant metrics monitoring systems, testing competencies in system architecture, ingestion and aggregation pipelines, time-series storage, alerting, capacity planning, and operational observability; it falls under the System Design domain and targets practical, architecture-level application rather than low-level coding. Interviewers commonly ask this to probe reasoning about trade-offs in cloud-native environments—such as collection models, sharding and replication, high-cardinality handling, retention and backfill, and monitoring-the-monitoring—while assessing judgment on reliability, latency, and tenant isolation; this summary is in English.

Related Interview Questions

  • Review a Web Application Architecture - LinkedIn (easy)
  • Scale a Distributed Randomized Multiset - LinkedIn (medium)
  • Design a Top-K Ranking Service - LinkedIn (easy)
  • Design a Global Calendar Service - LinkedIn (medium)
  • Design a malicious-URL checking service using an isMalicious API - LinkedIn (medium)
LinkedIn logo
LinkedIn
Jul 16, 2025, 12:00 AM
Machine Learning Engineer
Technical Screen
System Design
3
0

Design a Metrics Monitoring System for Large-Scale Services

Context

You are designing a metrics monitoring system for large-scale, cloud-native microservices running across multiple regions and clusters. Services are ephemeral (containers/autoscaling), and the platform is multi-tenant (infra teams, ML/feature teams, product services). Assume on the order of tens of thousands of hosts and hundreds of thousands of service instances, with strict SLOs for data freshness and alerting.

Requirements

  1. Compare push vs. pull metrics collection models:
    • When to choose each.
    • Impacts on reliability, backpressure, service discovery, network usage, and failure isolation.
  2. Describe the end-to-end architecture:
    • Client libraries/agents (e.g., SDK or node agent/sidecar).
    • Ingestion layer (APIs, gateways), queueing, and streaming aggregation.
    • Time-series storage, query layer, alerting, dashboards, and SLOs.
  3. Propose aggregation and rollup strategies at each layer:
    • Client-side, agent-side, stream processors, storage-side.
    • Handling high-cardinality labels, downsampling, late/out-of-order data, retention policies, and backfill.
  4. Provide a capacity plan, sharding and replication strategy, and multi-tenant isolation.
  5. Explain how you would test and monitor the monitoring system itself.

Make minimal, explicit assumptions as needed and call out trade-offs and guardrails.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More LinkedIn•More Machine Learning Engineer•LinkedIn Machine Learning Engineer•LinkedIn System Design•Machine Learning Engineer System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.