PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/J.P. Morgan

Design a Scalable Notification System

Last updated: Jul 1, 2026

Quick Overview

This question evaluates a candidate's ability to design a large-scale, asynchronous notification platform that fans messages out across push, email, and SMS channels. It tests system design skills around scalability, reliability, message queue architecture (Kafka, RabbitMQ), and multi-layer rate limiting, commonly probed to assess practical distributed-systems reasoning at a conceptual and architectural level.

  • medium
  • J.P. Morgan
  • System Design
  • Software Engineer

Design a Scalable Notification System

Company: J.P. Morgan

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Onsite

## Design a Scalable Notification System Design a notification service that other internal teams call to deliver messages to users across multiple channels: **push notifications** (mobile/web), **email**, and **SMS**. A team should be able to send a notification with a single API call (e.g. "order shipped," "password reset code," "weekly digest"), and the system should reliably fan it out to the right channel(s). The interview will probe **scalability**, **reliability**, **rate limiting**, and the role of **message queues** (Kafka / RabbitMQ) in the architecture. ### Constraints & Assumptions - Treat this as a large-scale internal platform: assume on the order of **tens of millions of notifications per day** with bursty peaks (e.g. a marketing blast or an incident alert to all users). - Three channels at launch — **push, email, SMS** — but the design should make adding a channel (e.g. WhatsApp, in-app inbox) cheap. - External delivery providers (APNs/FCM for push, an email provider like SES/SendGrid, an SMS provider like Twilio) are third parties with their **own rate limits and occasional outages**. - Some notifications are **transactional and high-priority** (OTP, password reset); others are **bulk and lower-priority** (digests, promotions). - The exact numbers are assumptions to state and adjust with the interviewer, not hard facts. ### Clarifying Questions to Ask - What **delivery guarantee** do we need per channel — at-least-once (with client dedup) or best-effort? Is exactly-once a requirement anywhere? - Do we need **user preferences and opt-outs** (per-channel, per-category) and quiet-hours, or is routing decided entirely by the caller? - What **latency targets** distinguish transactional (OTP within seconds) from bulk (digests within minutes/hours)? - Do we need **templating, localization, and personalization**, or do callers send fully-rendered content? - What are the **read-side requirements** — delivery status tracking, an in-app inbox, analytics — versus pure fire-and-forget send? - Are there **compliance** constraints (SMS sending windows, unsubscribe/CAN-SPAM, OTP not retried indefinitely)? ### Part 1: Core Architecture and Send Path Sketch the high-level architecture and the **write/send path**: the public API, how a single request is validated and fanned out to one or more channels, the data model, and the channel-specific senders/adapters. Show how the design keeps channels pluggable. ```hint Decompose by responsibility Split into an ingestion/API tier, a routing/orchestration tier (resolve recipient + preferences + template), and per-channel worker pools behind provider adapters. Decouple these tiers with a queue so the API can return immediately (accepted) without waiting on slow third-party providers. ``` #### Clarifying Questions for this Part - Does the caller pass a `userId` (we resolve address/device tokens) or fully-resolved destinations? - Is idempotency required on the send API (a client-supplied idempotency key to dedupe retries)? #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2: Scalability, Reliability, and Message Queues Explain why **message queues (Kafka / RabbitMQ)** are essential here, and how you use them to make the system scalable and reliable. Address: decoupling, buffering bursts, per-channel worker scaling, retries with backoff, dead-letter handling, idempotency/deduplication, and what happens when a downstream provider is slow or down. ```hint Why a queue at all The API must not block on third-party providers that are slow, rate-limited, or down. A durable queue lets the API accept-and-return, absorbs traffic spikes (the queue depth grows instead of the API falling over), and lets each channel scale its consumer pool independently. ``` ```hint Reliability mechanics Think at-least-once delivery + consumer idempotency (dedup key) to tolerate redelivery, exponential backoff with jitter on provider errors, a dead-letter queue for poison messages, and separate high-priority vs bulk queues/topics so an OTP never sits behind a million-message digest blast. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3: Rate Limiting Describe how you implement **rate limiting**, and at which layers. Cover protecting downstream providers (which impose their own caps), protecting users from notification spam, and protecting the platform from an abusive or buggy caller. ```hint Where limits live Distinguish three places: ingress limits per calling service/tenant, per-user/per-channel limits (anti-spam, quiet hours), and egress throttling toward each provider to stay under its quota. A distributed counter (e.g. token bucket in Redis) is the usual mechanism. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - A provider (e.g. the SMS gateway) goes down for 30 minutes. Walk through exactly what happens to in-flight and new SMS notifications, and how the system recovers without losing or duplicating messages. - How do you guarantee a one-time-passcode is delivered quickly even during a million-message marketing blast on the same platform? - How would you add **delivery-status tracking** and an in-app inbox (the read path) on top of this send-oriented design? - Compare Kafka vs RabbitMQ for this system — which would you pick for the bulk-fan-out path and which (if any) for transactional, and why?

Quick Answer: This question evaluates a candidate's ability to design a large-scale, asynchronous notification platform that fans messages out across push, email, and SMS channels. It tests system design skills around scalability, reliability, message queue architecture (Kafka, RabbitMQ), and multi-layer rate limiting, commonly probed to assess practical distributed-systems reasoning at a conceptual and architectural level.

Related Interview Questions

  • Design an E-Commerce Shopping Website - J.P. Morgan (medium)
  • Design a URL shortener - J.P. Morgan (easy)
  • Improve shopping cart latency and stock notifications - J.P. Morgan (easy)
  • Design a cloud-based global property rental platform - J.P. Morgan (medium)
  • Design a URL shortener - J.P. Morgan (medium)
|Home/System Design/J.P. Morgan

Design a Scalable Notification System

J.P. Morgan logo
J.P. Morgan
Jun 9, 2026, 12:00 AM
mediumSoftware EngineerOnsiteSystem Design
0
0

Design a Scalable Notification System

Design a notification service that other internal teams call to deliver messages to users across multiple channels: push notifications (mobile/web), email, and SMS. A team should be able to send a notification with a single API call (e.g. "order shipped," "password reset code," "weekly digest"), and the system should reliably fan it out to the right channel(s).

The interview will probe scalability, reliability, rate limiting, and the role of message queues (Kafka / RabbitMQ) in the architecture.

Constraints & Assumptions

  • Treat this as a large-scale internal platform: assume on the order of tens of millions of notifications per day with bursty peaks (e.g. a marketing blast or an incident alert to all users).
  • Three channels at launch — push, email, SMS — but the design should make adding a channel (e.g. WhatsApp, in-app inbox) cheap.
  • External delivery providers (APNs/FCM for push, an email provider like SES/SendGrid, an SMS provider like Twilio) are third parties with their own rate limits and occasional outages .
  • Some notifications are transactional and high-priority (OTP, password reset); others are bulk and lower-priority (digests, promotions).
  • The exact numbers are assumptions to state and adjust with the interviewer, not hard facts.

Clarifying Questions to Ask

  • What delivery guarantee do we need per channel — at-least-once (with client dedup) or best-effort? Is exactly-once a requirement anywhere?
  • Do we need user preferences and opt-outs (per-channel, per-category) and quiet-hours, or is routing decided entirely by the caller?
  • What latency targets distinguish transactional (OTP within seconds) from bulk (digests within minutes/hours)?
  • Do we need templating, localization, and personalization , or do callers send fully-rendered content?
  • What are the read-side requirements — delivery status tracking, an in-app inbox, analytics — versus pure fire-and-forget send?
  • Are there compliance constraints (SMS sending windows, unsubscribe/CAN-SPAM, OTP not retried indefinitely)?

Part 1: Core Architecture and Send Path

Sketch the high-level architecture and the write/send path: the public API, how a single request is validated and fanned out to one or more channels, the data model, and the channel-specific senders/adapters. Show how the design keeps channels pluggable.

Clarifying Questions for this Part

  • Does the caller pass a userId (we resolve address/device tokens) or fully-resolved destinations?
  • Is idempotency required on the send API (a client-supplied idempotency key to dedupe retries)?

What This Part Should Cover Premium

Part 2: Scalability, Reliability, and Message Queues

Explain why message queues (Kafka / RabbitMQ) are essential here, and how you use them to make the system scalable and reliable. Address: decoupling, buffering bursts, per-channel worker scaling, retries with backoff, dead-letter handling, idempotency/deduplication, and what happens when a downstream provider is slow or down.

What This Part Should Cover Premium

Part 3: Rate Limiting

Describe how you implement rate limiting, and at which layers. Cover protecting downstream providers (which impose their own caps), protecting users from notification spam, and protecting the platform from an abusive or buggy caller.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • A provider (e.g. the SMS gateway) goes down for 30 minutes. Walk through exactly what happens to in-flight and new SMS notifications, and how the system recovers without losing or duplicating messages.
  • How do you guarantee a one-time-passcode is delivered quickly even during a million-message marketing blast on the same platform?
  • How would you add delivery-status tracking and an in-app inbox (the read path) on top of this send-oriented design?
  • Compare Kafka vs RabbitMQ for this system — which would you pick for the bulk-fan-out path and which (if any) for transactional, and why?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More J.P. Morgan•More Software Engineer•J.P. Morgan Software Engineer•J.P. Morgan System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.