Design a multi-tenant alert notification system for operational incidents.
Monitoring sources send events when checks fire or recover. Users can configure alert rules, routing policies, schedules, and escalation chains. The platform must notify the right responders through channels such as email, SMS, push notification, chat integrations, and phone calls.
Assume requirements such as:
-
support for millions of alert events per day
-
p95 time to first notification under 30 seconds
-
at-least-once notification delivery
-
deduplication and suppression of repeated alerts
-
escalation if an alert is not acknowledged
-
user preferences, quiet hours, and on-call rotations
-
retries and failover when external notification providers are down
-
audit logs and delivery analytics
Describe the APIs, data model, high-level architecture, critical workflows, failure handling, and scaling strategy.