Multi-Channel Notifications And Watchlists

What's being tested

These prompts test whether you can design a high-scale distributed notification workflow that turns product events into reliable, personalized deliveries across email, push, SMS, and in-app surfaces. For Airbnb, this matters because notifications often sit on the critical path of booking, host response, fraud, cancellations, price drops, and availability alerts; missed, duplicated, or late messages directly harm trust. The interviewer is probing for system decomposition, event-driven architecture, data modeling, idempotency, rate limiting, preference handling, retry semantics, and operational thinking around `p99` latency and provider failures. Strong answers balance correctness and reliability without over-engineering every component into exactly-once fantasy.

Core knowledge

Event-driven architecture is the default shape: product services emit events such as `ReservationConfirmed`, `ListingAvailable`, or `MessageReceived` to a durable log like `Kafka`, `Pulsar`, or `Kinesis`; notification workers consume, fan out, template, and deliver. This decouples product write paths from slow external providers.
Notification intent should be modeled separately from delivery attempt. An intent says “user `U` should receive notification `N` for event `E`”; attempts track channel-specific sends like email, push, and SMS. This separation makes retries, auditing, and multi-channel fallback much easier.
Idempotency is non-negotiable. Use a deterministic key such as `hash(user_id, event_id, notification_type, channel)` and enforce uniqueness in `Postgres`, `DynamoDB`, or `Redis` before sending. Stripe-style `Idempotency-Key` semantics prevent duplicate messages when consumers retry after timeouts.
At-least-once delivery is realistic; exactly-once delivery is usually a presentation-layer illusion. Message brokers and workers may redeliver, so design consumers to be idempotent. If the broker guarantees ordered partitions, ordering only holds per partition key, not globally.
Fanout strategy depends on scale. For small audiences, perform fanout-on-write: create delivery rows immediately. For huge campaigns or broad watchlist matches, use batch workers or topic-based fanout. A rough capacity estimate is $QPS = \frac{\text{daily notifications}}{86{,}400} \times \text{peak factor}$ where peak factor is often `5x` to `20x`.
User preferences and policy checks should be central, not scattered across product teams. Store per-user, per-channel, per-notification-type settings: `marketing_email=false`, `booking_sms=true`, `quiet_hours=22:00-08:00`, `locale=en-US`. Transactional notifications may bypass some marketing preferences but still need channel validity and legal constraints.
Rate limiting applies at multiple levels: per user, per notification type, per channel, and per provider. Use token buckets in `Redis` for “no more than 3 promotional pushes per day” and provider-level throttles for `SendGrid`, `Twilio`, `APNs`, or `FCM` quotas. Decide whether excess traffic is dropped, delayed, or downgraded.
Retry policy should distinguish transient from permanent failures. HTTP `429`, `500`, and network timeouts get exponential backoff with jitter; invalid phone numbers, hard email bounces, and revoked push tokens should mark the channel unavailable. Send retries to a dead-letter queue after bounded attempts.
Template rendering needs versioning and localization. Store templates by `template_id`, `version`, `locale`, and channel, with required variables validated before enqueueing. Rendering can happen before queueing for auditability or at send time for fresher data; the tradeoff is reproducibility versus freshness.
Watchlist matching often involves date ranges and availability intervals. A rental watchlist might store (user_id, location_id, start_date, end_date, guests, price_max). Matching can be implemented with inverted indexes by location/date bucket, `Postgres` `GiST` indexes over range types, `Elasticsearch` filters, or precomputed daily buckets when query patterns are simple.
Time zones and DST are common traps. Store instants in UTC, but interpret user-facing dates and quiet hours in the listing or user time zone. A “check-in date” is not the same as a UTC timestamp; date-range availability should use local calendar semantics to avoid off-by-one errors near DST transitions.
Observability should cover the full funnel: `events_received`, `intents_created`, `eligible_after_preferences`, `queued`, `sent_to_provider`, `provider_accepted`, `delivered`, `opened`, `failed`, and `deduped`. Track `p50/p95/p99` enqueue-to-send latency, DLQ size, retry counts, and provider error rates by channel.

Worked example

For Design rental watchlist and notification system, a strong candidate starts by asking clarifying questions: are users watching exact listings or flexible criteria, how fresh must alerts be, how many active watchlists exist, and are notifications transactional or promotional? Then they declare assumptions, for example: 50 million active watchlists, availability changes are event-driven, users can specify location, dates, guests, price, and preferred channels, and alerts should usually arrive within a few minutes.

The answer can be organized around four pillars: data model, matching pipeline, notification pipeline, and correctness/operations. The data model includes `Watchlist(user_id, criteria, date_range, timezone, status)` and an availability source emitting `ListingAvailabilityChanged` events. The matching pipeline consumes availability changes, looks up candidate watchlists using location and date indexes, filters by guests/price/rules, and emits `WatchlistMatched` intents. The notification pipeline checks preferences, deduplicates by (watchlist_id, listing_id, available_date_range, notification_type), renders templates, and sends through push/email/SMS workers.

A key tradeoff to flag is event-driven matching versus periodic batch scans. Event-driven matching gives lower latency and lower average work, but it is more sensitive to missed events and requires replay/backfill from the event log; batch scans are simpler and safer for reconciliation but can be expensive and less timely. A polished close would say: “If I had more time, I’d add replay tooling, DLQ reprocessing, multi-region failover, and a reconciliation job that verifies watchlists against current availability to catch missed events.”

A second angle

For Design a multi-channel notification system, the watchlist-specific matching layer becomes less important, and the generic delivery platform becomes the center of the design. The core abstraction is a `NotificationRequest` API used by many product services, with fields such as `recipient_id`, `notification_type`, `event_id`, `priority`, `template_variables`, and `allowed_channels`. The interviewer is likely to push harder on channel fallback, provider integrations, preference enforcement, and global rate limits. You can reuse the same principles: durable queues, idempotent consumers, centralized preferences, per-channel delivery attempts, retries with DLQs, and full-funnel observability. The main constraint shift is from “how do we find who should be notified?” to “how do we reliably deliver many different notification types without spamming or duplicating users?”

Common pitfalls

Pitfall: Treating notifications as a synchronous API call from the product service.

A tempting answer is “booking service calls email/SMS providers directly after a booking.” That couples user-facing latency and availability to third-party providers, makes retries unsafe, and spreads preference logic across services. A better answer persists an intent, publishes to a durable queue, and lets dedicated workers handle delivery asynchronously.

Pitfall: Claiming exactly-once delivery without explaining idempotency.

Interviewers often probe this by asking what happens if a worker sends an SMS but crashes before committing the offset. The correct framing is that external side effects are not truly exactly-once; you approximate correctness with idempotency keys, delivery records, provider request IDs when supported, and dedupe windows.

Pitfall: Skipping edge cases around user experience and correctness.

A shallow design may cover `Kafka` and workers but ignore quiet hours, unsubscribes, invalid push tokens, DST, duplicate watchlist matches, provider throttling, and delayed retries. The stronger answer explicitly separates eligibility, dedupe, send attempts, and user-visible audit trails, then names the failure modes each layer handles.

Connections

Interviewers may pivot from here into rate limiter design, feed fanout, event sourcing, distributed job queues, or calendar availability modeling. For Airbnb-style systems, expect follow-ups on multi-region reliability, `Kafka` partitioning strategy, API idempotency, and how to debug a spike in duplicate or delayed notifications.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts