Multi-Channel Notification Systems

What's being tested

This probes whether you can design a reliable distributed notification platform that fans out messages across push, SMS, email, and in-app channels without spamming users or losing critical alerts. DoorDash cares because notifications sit on high-value workflows: order status, Dasher assignment, delivery issues, promotions, support escalations, and operational alerts. The interviewer is looking for API design, queue-based architecture, delivery guarantees, idempotency, rate limiting, user preferences, provider failures, observability, and graceful degradation. A strong answer separates the product event from channel delivery and makes tradeoffs explicit instead of promising “exactly once” delivery everywhere.

Core knowledge

Functional requirements should distinguish notification types: transactional order updates, time-sensitive alerts, marketing messages, and internal operational alerts. Each class has different latency, retry, opt-out, and compliance behavior; for example, “order delivered” may target sub-second push, while marketing email can tolerate minutes.
Non-functional requirements should be quantified early: peak events/sec, recipients per event, target p99 latency, retention period, and acceptable duplicate rate. A simple sizing sketch is: if DoorDash emits 20k notification events/sec and each fans out to 1.5 channels on average, downstream workers process 30k channel jobs/sec before retries.
API design usually starts with POST /notifications accepting event_type, recipient_id or audience, template_id, idempotency_key, priority, metadata, and optional scheduled_at. Keep the API asynchronous: return 202 Accepted with notification_id, then expose GET /notifications/{id} for status.
Data model should separate notification intent from delivery attempt. A notifications table stores the logical message; notification_deliveries stores channel-specific attempts such as PUSH, SMS, EMAIL, status, provider response, retry count, and timestamps. This prevents one failed SMS from corrupting the whole notification state.
Message queues are the central scaling primitive. Use Kafka, Amazon SQS, RabbitMQ, or similar to decouple producers from channel workers. A common flow is API service → validation/preferences → durable event topic → fanout service → per-channel queues → provider adapters.
Delivery guarantees are typically at-least-once, not exactly-once. Workers may retry after timeout, provider ambiguity, or crash, so consumers must be idempotent. Use a unique idempotency_key, dedupe table, or Redis SETNX with TTL to suppress duplicate logical sends.
Ordering matters only for some categories. Order state notifications like “picked up” before “delivered” may need per-order ordering by partitioning on order_id in Kafka. Global ordering is expensive and usually unnecessary; prefer local ordering where user experience depends on it.
Retry strategy should combine exponential backoff, jitter, max attempts, and a dead-letter queue. Example: retry after $min(base \times 2^n + jitter, maxDelay)$ , stop after 5 attempts for push/email, and route to DLQ for inspection. Avoid retry storms when providers degrade.
Rate limiting must exist at multiple layers: per-user anti-spam limits, per-tenant limits, provider quota limits, and global system protection. Token bucket is a standard choice: refill rate $r$ tokens/sec, capacity $b$ burst tokens. For SMS providers, enforce strict provider-specific throughput.
Preferences and compliance are first-class backend concerns. Store channel opt-ins, quiet hours, locale, device tokens, unsubscribed categories, and legal constraints. Transactional messages may bypass some marketing preferences, but the system should encode this explicitly rather than relying on caller judgment.
Provider abstraction prevents vendor lock-in and isolates failures. Channel adapters wrap APNs, Firebase Cloud Messaging, Twilio, SendGrid, or internal email services behind a common interface: send(message) -> provider_message_id/status. Still preserve provider-specific error codes for debugging and retry classification.
Observability needs metrics at every stage: accepted requests, queue lag, fanout rate, provider success rate, retry count, duplicate suppressions, p50/p95/p99 latency, and DLQ volume. Add structured logs with notification_id, delivery_id, recipient_id, channel, and provider_message_id for traceability.

Worked example

For Design a multi-channel notification system, a strong candidate starts by asking: “Are we supporting transactional, marketing, and alert notifications? What channels are required? What scale and latency targets should I design for? Do we need user preferences and scheduled sends?” Then declare assumptions: multi-tenant service, push/email/SMS/in-app, at-least-once delivery, 50k channel sends/sec peak, and transactional notifications prioritized over marketing.

Organize the answer around four pillars: ingestion API, durable fanout pipeline, channel delivery workers, and control-plane services like templates, preferences, rate limits, and observability. The core architecture could be: Notification API validates and writes a notification record, publishes to Kafka, a fanout worker resolves recipients, preferences, templates, and channels, then emits delivery jobs to per-channel queues. Channel workers call providers such as FCM, APNs, Twilio, and SendGrid, persist delivery attempts, and retry transient failures with backoff.

One important tradeoff to flag is latency versus preference/template correctness. You can precompute user channel preferences for speed, but then opt-out changes may be stale; for critical compliance-sensitive channels, read the latest preference or use a cache with short TTL and invalidation. Close by saying: “If I had more time, I’d go deeper on multi-region failover, scheduled notifications, DLQ replay tooling, and how to test provider outages without sending real messages.”

A second angle

For Design an alert notification system, the same primitives apply, but the constraints shift toward urgency, escalation, and reliability under incident conditions. Instead of marketing-style fanout, you may need priority queues, dedupe windows, on-call schedules, escalation policies, and acknowledgement tracking. The design should support “notify primary engineer by push/SMS, wait 5 minutes, escalate to secondary if unacknowledged,” which makes workflow state more important than template management. Rate limiting is still needed, but critical alerts may bypass quiet hours while suppressing repeated alerts for the same incident using an incident_id dedupe key. The interviewer may push on how the system behaves when a provider is down; a strong answer routes to alternate channels and exposes alert delivery health as its own monitored dependency.

Common pitfalls

Pitfall: Promising exactly-once delivery for external channels.

SMS, email, and push providers do not give true end-to-end exactly-once semantics, and network timeouts can leave delivery status ambiguous. A better answer is at-least-once processing with idempotency at the notification/job level, dedupe windows, provider message IDs, and user-visible tolerance for rare duplicates.

Pitfall: Treating every notification as the same priority.

A design that puts order-critical updates, coupons, and internal alerts through one FIFO queue will fail during spikes. Separate by priority and category, reserve capacity for transactional notifications, and allow marketing traffic to be delayed or dropped under load.

Pitfall: Spending all the time on boxes and arrows but not on failure modes.

Interviewers expect you to discuss provider outages, poison messages, duplicate sends, stale device tokens, queue backlog, and retry storms. Land better by walking one failure path end to end: provider returns 429, worker classifies it as retryable, token bucket reduces send rate, jobs back off with jitter, and persistent failures go to DLQ.

Connections

This topic often pivots into distributed queues, idempotency, rate limiting, workflow orchestration, and multi-tenant service design. You may also be asked to extend the design with a cron scheduler for delayed campaigns, a template service with localization, or a real-time WebSocket/in-app notification feed.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts