Design a real-time messenger
Company: Meta
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Onsite
##### Question
Design a real-time messaging system (a Messenger/WhatsApp-style product) that supports **1:1 and group chats**. Walk through requirements, the API surface, the data model, storage and protocol choices, ordering and delivery guarantees, multi-device sync, and the operational concerns of running it at Meta scale.
Address the following:
1. **Requirements.** Clarify functional requirements (1:1 + group chat, multi-device sync, presence/typing indicators, read receipts, media attachments, message search, push notifications) and nonfunctional requirements (scale, latency, availability, consistency, cost trade-offs).
2. **Scale & latency targets.** Design for **100M MAU, 10M DAU, ~2M concurrent connections per region**, and a **p99 send-to-receive latency under 150 ms** intra-region. Estimate message volume and storage from these numbers.
3. **APIs.** Define the API surface, e.g. `SendMessage`, `Sync`, `Ack` (delivered/read), `FetchHistory`, and `SubscribePresence`. Specify request/response schemas and which run over WebSocket (data plane) vs REST/gRPC (control plane).
4. **Message IDs & ordering.** Choose message identifiers (client message ID for idempotency, server sequence number for canonical order) and define what ordering you guarantee within a conversation and across conversations.
5. **Delivery semantics.** Specify the delivery guarantee (at-least-once transport with server-side deduplication) and explain how you achieve an exactly-once user experience (no duplicate bubbles, no lost messages).
6. **Real-time fanout.** Propose the gateway / WebSocket layer and the pub/sub fanout to recipients, including backpressure handling and large-group fanout.
7. **Persistence, indexing & cold storage.** Choose the message store, search index, media store, and how you tier hot vs cold (archival) storage.
8. **Partitioning & sharding.** Choose shard keys (e.g. by conversation vs by user) and justify the trade-offs for ordering, fanout, and per-user mailboxes.
9. **Multi-device sync & offline.** Handle per-device cursors, reconnect/backfill, gap and out-of-order handling, and offline/slow-network scenarios.
10. **Presence, typing & read receipts.** Design ephemeral presence/typing state and per-user read watermarks, including multi-device aggregation.
11. **Media handling.** Pre-signed uploads, transcoding/thumbnails, CDN delivery, and resumable/chunked uploads.
12. **Search.** Server-side indexing with ACL filtering, and how search degrades under end-to-end encryption.
13. **End-to-end encryption.** Discuss E2EE options (Double Ratchet for 1:1, MLS for groups), what the server can and cannot do under E2EE, and key management.
14. **Abuse, spam & safety.** Rate limits/quotas, spam scoring, blocklists/reporting, and attachment scanning.
15. **Retention, TTL & GDPR.** Retention policies, disappearing messages, hard-delete / right-to-be-forgotten, and data residency.
16. **Multi-region & failover.** Active-active deployment, conversation home-region affinity, cross-region replication, and failover behavior.
17. **Observability & cost.** Metrics, tracing, logging, SLOs, and the major cost trade-offs.
Quick Answer: A Meta software-engineer system-design onsite question: design a real-time messenger supporting 1:1 and group chats at 100M MAU with p99 send-to-receive latency under 150 ms. It covers APIs, message ordering and idempotency, delivery semantics, multi-device sync, WebSocket fanout, storage/search, E2EE, abuse controls, retention/GDPR, multi-region failover, and observability.