Design a global notification service

Q: Design a global notification service

This is a System Design interview question from TikTok for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: Globally Distributed Notification Service

Context

You are designing a multi-tenant notification platform that delivers real-time and scheduled messages (email, SMS, push) to tens of millions of users worldwide. The system must comply with regional data residency and privacy regulations, support high availability across multiple regions, and provide strong operational controls (idempotency, deduplication, rate limiting, retries, monitoring).

Assume:

Users and data are partitioned by region (e.g., US, EU, APAC) with strict residency for PII.
The platform offers APIs to trigger individual and bulk notifications and manage templates and user preferences.
Peak traffic can spike rapidly (e.g., incident alerts, promotions).

Requirements

Design a system that addresses the following:

APIs and Data Models

Define REST APIs for:
- Sending real-time and scheduled notifications (single and bulk)
- Managing templates and variables
- Managing user preferences and subscriptions
- Retrieving message status and delivery receipts
Specify core data models (Message, DeliveryAttempt, Template, Campaign/Job, UserPreference, IdempotencyRecord, RateLimitBucket).

Deduplication, Idempotency, and Rate Limiting

Describe how to prevent duplicate sends across retries and concurrent requests.
Provide idempotency strategy at API and worker levels.
Define rate limiting scopes (per-user, per-tenant, per-channel, per-provider) and algorithms.

Storage and Queueing Layers

Choose storage for:
- Control-plane metadata (tenants, templates, campaigns)
- Regional data-plane (messages, attempts, user preferences)
- Caching (idempotency keys, rate limiter state)
- Object/blob storage (large templates, assets)
Choose queueing/streaming for fan-out, ordering, retries, scheduled delivery, and DLQs.

Worker Orchestration, Retry/Backoff, Ordering

Describe worker topology and autoscaling.
Define retry/backoff policies and DLQ handling by failure type.
Specify ordering guarantees (e.g., per-user per-channel) and how partitions/keys enforce it.

Multi-Region Architecture, Failover, and Disaster Recovery

Active-active by region with data residency.
Control-plane and data-plane split; inter-region replication where lawful.
Provider redundancy and failover strategy.
RTO/RPO targets and DR workflows.

Capacity Planning (Rough Estimates)

QPS, throughput, partitions, worker counts, storage footprint, cache sizing.
Include formulas and a worked example for tens of millions of users.

Monitoring and Alerting

SLOs/SLIs, key metrics, logs/traces, synthetic checks.
Alerting policies and on-call runbooks.

State key assumptions, call out trade-offs, and justify major design choices.

Design a global notification service

System Design: Globally Distributed Notification Service

Context

Requirements

Solution

Comments (0)