Design a Scalable Notification System
Company: J.P. Morgan
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
## Design a Scalable Notification System
Design a notification service that other internal teams call to deliver messages to users across multiple channels: **push notifications** (mobile/web), **email**, and **SMS**. A team should be able to send a notification with a single API call (e.g. "order shipped," "password reset code," "weekly digest"), and the system should reliably fan it out to the right channel(s).
The interview will probe **scalability**, **reliability**, **rate limiting**, and the role of **message queues** (Kafka / RabbitMQ) in the architecture.
### Constraints & Assumptions
- Treat this as a large-scale internal platform: assume on the order of **tens of millions of notifications per day** with bursty peaks (e.g. a marketing blast or an incident alert to all users).
- Three channels at launch — **push, email, SMS** — but the design should make adding a channel (e.g. WhatsApp, in-app inbox) cheap.
- External delivery providers (APNs/FCM for push, an email provider like SES/SendGrid, an SMS provider like Twilio) are third parties with their **own rate limits and occasional outages**.
- Some notifications are **transactional and high-priority** (OTP, password reset); others are **bulk and lower-priority** (digests, promotions).
- The exact numbers are assumptions to state and adjust with the interviewer, not hard facts.
### Clarifying Questions to Ask
- What **delivery guarantee** do we need per channel — at-least-once (with client dedup) or best-effort? Is exactly-once a requirement anywhere?
- Do we need **user preferences and opt-outs** (per-channel, per-category) and quiet-hours, or is routing decided entirely by the caller?
- What **latency targets** distinguish transactional (OTP within seconds) from bulk (digests within minutes/hours)?
- Do we need **templating, localization, and personalization**, or do callers send fully-rendered content?
- What are the **read-side requirements** — delivery status tracking, an in-app inbox, analytics — versus pure fire-and-forget send?
- Are there **compliance** constraints (SMS sending windows, unsubscribe/CAN-SPAM, OTP not retried indefinitely)?
### Part 1: Core Architecture and Send Path
Sketch the high-level architecture and the **write/send path**: the public API, how a single request is validated and fanned out to one or more channels, the data model, and the channel-specific senders/adapters. Show how the design keeps channels pluggable.
```hint Decompose by responsibility
Split into an ingestion/API tier, a routing/orchestration tier (resolve recipient + preferences + template), and per-channel worker pools behind provider adapters. Decouple these tiers with a queue so the API can return immediately (accepted) without waiting on slow third-party providers.
```
#### Clarifying Questions for this Part
- Does the caller pass a `userId` (we resolve address/device tokens) or fully-resolved destinations?
- Is idempotency required on the send API (a client-supplied idempotency key to dedupe retries)?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2: Scalability, Reliability, and Message Queues
Explain why **message queues (Kafka / RabbitMQ)** are essential here, and how you use them to make the system scalable and reliable. Address: decoupling, buffering bursts, per-channel worker scaling, retries with backoff, dead-letter handling, idempotency/deduplication, and what happens when a downstream provider is slow or down.
```hint Why a queue at all
The API must not block on third-party providers that are slow, rate-limited, or down. A durable queue lets the API accept-and-return, absorbs traffic spikes (the queue depth grows instead of the API falling over), and lets each channel scale its consumer pool independently.
```
```hint Reliability mechanics
Think at-least-once delivery + consumer idempotency (dedup key) to tolerate redelivery, exponential backoff with jitter on provider errors, a dead-letter queue for poison messages, and separate high-priority vs bulk queues/topics so an OTP never sits behind a million-message digest blast.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3: Rate Limiting
Describe how you implement **rate limiting**, and at which layers. Cover protecting downstream providers (which impose their own caps), protecting users from notification spam, and protecting the platform from an abusive or buggy caller.
```hint Where limits live
Distinguish three places: ingress limits per calling service/tenant, per-user/per-channel limits (anti-spam, quiet hours), and egress throttling toward each provider to stay under its quota. A distributed counter (e.g. token bucket in Redis) is the usual mechanism.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- A provider (e.g. the SMS gateway) goes down for 30 minutes. Walk through exactly what happens to in-flight and new SMS notifications, and how the system recovers without losing or duplicating messages.
- How do you guarantee a one-time-passcode is delivered quickly even during a million-message marketing blast on the same platform?
- How would you add **delivery-status tracking** and an in-app inbox (the read path) on top of this send-oriented design?
- Compare Kafka vs RabbitMQ for this system — which would you pick for the bulk-fan-out path and which (if any) for transactional, and why?
Quick Answer: This question evaluates a candidate's ability to design a large-scale, asynchronous notification platform that fans messages out across push, email, and SMS channels. It tests system design skills around scalability, reliability, message queue architecture (Kafka, RabbitMQ), and multi-layer rate limiting, commonly probed to assess practical distributed-systems reasoning at a conceptual and architectural level.