System Design: Exception Monitoring with Top-K
Design an exception monitoring system for a microservices environment.
Core requirements
-
Services emit exception events (message, stack trace, service name, environment, version, timestamp, severity, request context).
-
The system should enable on-call engineers to:
-
View
Top K exceptions
over a time window (e.g., last 5/15/60 minutes), grouped/deduplicated by “same exception.”
-
Filter by service, environment (prod/staging), deployment version, region.
-
Drill down into a group to see recent samples and aggregated stats.
Non-functional requirements
-
High write throughput, low-latency queries for Top K.
-
Handle duplicates, retries, bursts (incident storms).
-
Retain raw data for debugging (e.g., 7–30 days) and aggregated metrics longer.
-
Protect sensitive data in payloads.
Clarifications to address
-
How exceptions are
collected
from services.
-
How events are
grouped
(fingerprinting) and how you store/query efficiently.
-
What the
database schema
/ key columns look like for both raw events and aggregates.
Deliverables: high-level architecture, data flow, storage choices, and APIs used by UI/on-call tooling.