Multi-Channel Notification Systems
Asked of: Software Engineer
Last updated

What's being tested
This probes whether you can design a reliable distributed notification platform that fans out messages across push, SMS, email, and in-app channels without spamming users or losing critical alerts. DoorDash cares because notifications sit on high-value workflows: order status, Dasher assignment, delivery issues, promotions, support escalations, and operational alerts. The interviewer is looking for API design, queue-based architecture, delivery guarantees, idempotency, rate limiting, user preferences, provider failures, observability, and graceful degradation. A strong answer separates the product event from channel delivery and makes tradeoffs explicit instead of promising “exactly once” delivery everywhere.
Core knowledge
-
Functional requirements should distinguish notification types: transactional order updates, time-sensitive alerts, marketing messages, and internal operational alerts. Each class has different latency, retry, opt-out, and compliance behavior; for example, “order delivered” may target sub-second push, while marketing email can tolerate minutes.
-
Non-functional requirements should be quantified early: peak events/sec, recipients per event, target
p99latency, retention period, and acceptable duplicate rate. A simple sizing sketch is: ifDoorDashemits 20k notification events/sec and each fans out to 1.5 channels on average, downstream workers process 30k channel jobs/sec before retries. -
API design usually starts with
POST /notificationsacceptingevent_type,recipient_idor audience,template_id,idempotency_key,priority,metadata, and optionalscheduled_at. Keep the API asynchronous: return202 Acceptedwithnotification_id, then exposeGET /notifications/{id}for status. -
Data model should separate notification intent from delivery attempt. A
notificationstable stores the logical message;notification_deliveriesstores channel-specific attempts such asPUSH,SMS,EMAIL, status, provider response, retry count, and timestamps. This prevents one failed SMS from corrupting the whole notification state. -
Message queues are the central scaling primitive. Use
Kafka,Amazon SQS,RabbitMQ, or similar to decouple producers from channel workers. A common flow is API service → validation/preferences → durable event topic → fanout service → per-channel queues → provider adapters. -
Delivery guarantees are typically at-least-once, not exactly-once. Workers may retry after timeout, provider ambiguity, or crash, so consumers must be idempotent. Use a unique
idempotency_key, dedupe table, orRedis SETNXwith TTL to suppress duplicate logical sends. -
Ordering matters only for some categories. Order state notifications like “picked up” before “delivered” may need per-order ordering by partitioning on
order_idinKafka. Global ordering is expensive and usually unnecessary; prefer local ordering where user experience depends on it. -
Retry strategy should combine exponential backoff, jitter, max attempts, and a dead-letter queue. Example: retry after , stop after 5 attempts for push/email, and route to
DLQfor inspection. Avoid retry storms when providers degrade. -
Rate limiting must exist at multiple layers: per-user anti-spam limits, per-tenant limits, provider quota limits, and global system protection. Token bucket is a standard choice: refill rate tokens/sec, capacity burst tokens. For SMS providers, enforce strict provider-specific throughput.
-
Preferences and compliance are first-class backend concerns. Store channel opt-ins, quiet hours, locale, device tokens, unsubscribed categories, and legal constraints. Transactional messages may bypass some marketing preferences, but the system should encode this explicitly rather than relying on caller judgment.
-
Provider abstraction prevents vendor lock-in and isolates failures. Channel adapters wrap
APNs,Firebase Cloud Messaging,Twilio,SendGrid, or internal email services behind a common interface:send(message) -> provider_message_id/status. Still preserve provider-specific error codes for debugging and retry classification. -
Observability needs metrics at every stage: accepted requests, queue lag, fanout rate, provider success rate, retry count, duplicate suppressions,
p50/p95/p99latency, andDLQvolume. Add structured logs withnotification_id,delivery_id,recipient_id,channel, andprovider_message_idfor traceability.
Worked example
For Design a multi-channel notification system, a strong candidate starts by asking: “Are we supporting transactional, marketing, and alert notifications? What channels are required? What scale and latency targets should I design for? Do we need user preferences and scheduled sends?” Then declare assumptions: multi-tenant service, push/email/SMS/in-app, at-least-once delivery, 50k channel sends/sec peak, and transactional notifications prioritized over marketing.
Organize the answer around four pillars: ingestion API, durable fanout pipeline, channel delivery workers, and control-plane services like templates, preferences, rate limits, and observability. The core architecture could be: Notification API validates and writes a notification record, publishes to Kafka, a fanout worker resolves recipients, preferences, templates, and channels, then emits delivery jobs to per-channel queues. Channel workers call providers such as FCM, APNs, Twilio, and SendGrid, persist delivery attempts, and retry transient failures with backoff.
One important tradeoff to flag is latency versus preference/template correctness. You can precompute user channel preferences for speed, but then opt-out changes may be stale; for critical compliance-sensitive channels, read the latest preference or use a cache with short TTL and invalidation. Close by saying: “If I had more time, I’d go deeper on multi-region failover, scheduled notifications, DLQ replay tooling, and how to test provider outages without sending real messages.”
A second angle
For Design an alert notification system, the same primitives apply, but the constraints shift toward urgency, escalation, and reliability under incident conditions. Instead of marketing-style fanout, you may need priority queues, dedupe windows, on-call schedules, escalation policies, and acknowledgement tracking. The design should support “notify primary engineer by push/SMS, wait 5 minutes, escalate to secondary if unacknowledged,” which makes workflow state more important than template management. Rate limiting is still needed, but critical alerts may bypass quiet hours while suppressing repeated alerts for the same incident using an incident_id dedupe key. The interviewer may push on how the system behaves when a provider is down; a strong answer routes to alternate channels and exposes alert delivery health as its own monitored dependency.
Common pitfalls
Pitfall: Promising exactly-once delivery for external channels.
SMS, email, and push providers do not give true end-to-end exactly-once semantics, and network timeouts can leave delivery status ambiguous. A better answer is at-least-once processing with idempotency at the notification/job level, dedupe windows, provider message IDs, and user-visible tolerance for rare duplicates.
Pitfall: Treating every notification as the same priority.
A design that puts order-critical updates, coupons, and internal alerts through one FIFO queue will fail during spikes. Separate by priority and category, reserve capacity for transactional notifications, and allow marketing traffic to be delayed or dropped under load.
Pitfall: Spending all the time on boxes and arrows but not on failure modes.
Interviewers expect you to discuss provider outages, poison messages, duplicate sends, stale device tokens, queue backlog, and retry storms. Land better by walking one failure path end to end: provider returns 429, worker classifies it as retryable, token bucket reduces send rate, jobs back off with jitter, and persistent failures go to DLQ.
Connections
This topic often pivots into distributed queues, idempotency, rate limiting, workflow orchestration, and multi-tenant service design. You may also be asked to extend the design with a cron scheduler for delayed campaigns, a template service with localization, or a real-time WebSocket/in-app notification feed.
Further reading
-
Designing Data-Intensive Applications — excellent grounding for queues, logs, replication, idempotency, and reliability tradeoffs.
-
Stripe API Idempotency — practical reference for using idempotency keys in externally visible APIs.
-
The Tail at Scale — useful for reasoning about
p99latency, retries, hedging, and distributed-service behavior.
Featured in interview prep guides
Practice questions
- Design an alert notification systemDoorDash · Software Engineer · Onsite · easy
- Design a multi-channel notification systemDoorDash · Software Engineer · Take-home Project · medium
- Design a notification systemDoorDash · Software Engineer · Technical Screen · hard
- Design cron scheduler and reward/review systemDoorDash · Software Engineer · Onsite · hard
- Design notification and project architectureDoorDash · Software Engineer · Technical Screen · hard
Related concepts
- Multi-Channel Notifications And WatchlistsSystem Design
- Real-Time Messaging And Collaboration SystemsSystem Design
- Messaging, Event Pipelines, and Delivery SemanticsSystem Design
- Slack-Like Messaging SystemsSystem Design
- Notifications And Lifecycle Engagement
- Notifications And Push Notification AnalyticsAnalytics & Experimentation