Design a Multi‑Channel Notification System (Push, SMS, Email) with SLAs, Preferences, and Spike Resilience
Context
You are designing a notifications platform for a large, real‑time consumer marketplace. The system must deliver messages over push, SMS, and email while honoring per‑recipient preferences and per‑message delivery SLAs. Traffic consists of critical transactional events (e.g., order status) and non‑critical marketing/retention messages, with occasional large spikes.
Requirements
-
Functional
-
Ingest notification events via an API and fan out to the right channels based on recipient preferences and message policy.
-
Support per‑recipient delivery preferences (channel opt‑in/out, quiet hours, frequency caps, locale) and SLAs (e.g., deliver within 30 seconds).
-
Render templates (per channel) with personalization and localization.
-
Integrate with multiple providers per channel for redundancy and cost control.
-
Non‑functional
-
Idempotency and de‑duplication across retries and at‑least‑once pipelines.
-
Rate limiting at multiple levels (recipient, template/campaign, provider quotas).
-
Retries with exponential backoff, jitter, and dead‑letter queues (DLQs).
-
Observability: end‑to‑end tracing, metrics (latency, success, queue depth), provider health, and alerts.
-
High availability and multi‑region failover.
-
Handle sudden 100× traffic spikes without cascading failures.
Deliverables
-
Describe the high‑level architecture: ingestion, topic/fanout, queuing, worker fleets, provider integrations.
-
Explain idempotency and de‑duplication, rate limiting, retries/backoff with DLQs, template rendering, and observability.
-
Deep dive: handling a sudden 100× traffic spike while meeting provider quotas, preventing queue buildup, and avoiding cascading failures. Discuss:
-
Autoscaling triggers
-
Load shedding strategies
-
Priority queues and fairness
-
Multi‑region failover
-
Cost controls