System Design: Globally Distributed Notification Service
Context
You are designing a multi-tenant notification platform that delivers real-time and scheduled messages (email, SMS, push) to tens of millions of users worldwide. The system must comply with regional data residency and privacy regulations, support high availability across multiple regions, and provide strong operational controls (idempotency, deduplication, rate limiting, retries, monitoring).
Assume:
-
Users and data are partitioned by region (e.g., US, EU, APAC) with strict residency for PII.
-
The platform offers APIs to trigger individual and bulk notifications and manage templates and user preferences.
-
Peak traffic can spike rapidly (e.g., incident alerts, promotions).
Requirements
Design a system that addresses the following:
-
APIs and Data Models
-
Define REST APIs for:
-
Sending real-time and scheduled notifications (single and bulk)
-
Managing templates and variables
-
Managing user preferences and subscriptions
-
Retrieving message status and delivery receipts
-
Specify core data models (Message, DeliveryAttempt, Template, Campaign/Job, UserPreference, IdempotencyRecord, RateLimitBucket).
-
Deduplication, Idempotency, and Rate Limiting
-
Describe how to prevent duplicate sends across retries and concurrent requests.
-
Provide idempotency strategy at API and worker levels.
-
Define rate limiting scopes (per-user, per-tenant, per-channel, per-provider) and algorithms.
-
Storage and Queueing Layers
-
Choose storage for:
-
Control-plane metadata (tenants, templates, campaigns)
-
Regional data-plane (messages, attempts, user preferences)
-
Caching (idempotency keys, rate limiter state)
-
Object/blob storage (large templates, assets)
-
Choose queueing/streaming for fan-out, ordering, retries, scheduled delivery, and DLQs.
-
Worker Orchestration, Retry/Backoff, Ordering
-
Describe worker topology and autoscaling.
-
Define retry/backoff policies and DLQ handling by failure type.
-
Specify ordering guarantees (e.g., per-user per-channel) and how partitions/keys enforce it.
-
Multi-Region Architecture, Failover, and Disaster Recovery
-
Active-active by region with data residency.
-
Control-plane and data-plane split; inter-region replication where lawful.
-
Provider redundancy and failover strategy.
-
RTO/RPO targets and DR workflows.
-
Capacity Planning (Rough Estimates)
-
QPS, throughput, partitions, worker counts, storage footprint, cache sizing.
-
Include formulas and a worked example for tens of millions of users.
-
Monitoring and Alerting
-
SLOs/SLIs, key metrics, logs/traces, synthetic checks.
-
Alerting policies and on-call runbooks.
State key assumptions, call out trade-offs, and justify major design choices.