Design a scalable metrics monitoring system
Company: LinkedIn
Role: Machine Learning Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
Design a metrics monitoring system for large-scale services. Compare push vs pull collection models—when to choose each, and their impacts on reliability, backpressure, service discovery, network usage, and failure isolation. Describe the end-to-end architecture: client libraries/agents, ingestion, queueing, streaming aggregation, storage in a time-series database, alerting, dashboards, and SLOs. Propose aggregation and rollup strategies (client-side, agent-side, stream, storage-side), handling of high-cardinality labels, downsampling, late/out-of-order data, retention policies, and backfill. Provide a capacity plan, sharding and replication strategy, and multi-tenant isolation. Explain how you would test and monitor the system itself.
Quick Answer: Design a scalable metrics monitoring system evaluates requirements, scale assumptions, API/data design, architecture, trade-offs, failure modes, and rollout in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.