Design an exception monitoring system with top‑K
Company: LinkedIn
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
## System Design: Exception Monitoring with Top-K
Design an **exception monitoring system** for a microservices environment.
### Core requirements
- Services emit exception events (message, stack trace, service name, environment, version, timestamp, severity, request context).
- The system should enable on-call engineers to:
- View **Top K exceptions** over a time window (e.g., last 5/15/60 minutes), grouped/deduplicated by “same exception.”
- Filter by service, environment (prod/staging), deployment version, region.
- Drill down into a group to see recent samples and aggregated stats.
### Non-functional requirements
- High write throughput, low-latency queries for Top K.
- Handle duplicates, retries, bursts (incident storms).
- Retain raw data for debugging (e.g., 7–30 days) and aggregated metrics longer.
- Protect sensitive data in payloads.
### Clarifications to address
- How exceptions are **collected** from services.
- How events are **grouped** (fingerprinting) and how you store/query efficiently.
- What the **database schema** / key columns look like for both raw events and aggregates.
Deliverables: high-level architecture, data flow, storage choices, and APIs used by UI/on-call tooling.
Quick Answer: This question evaluates the ability to design a scalable, low-latency exception monitoring system focusing on streaming ingestion, event grouping/fingerprinting, data modeling for raw and aggregated stores, retention policies, and payload privacy.