Real-Time Messaging And Collaboration Systems

What's being tested

These interviews test whether you can design a low-latency, durable, multi-user collaboration system under realistic constraints: real-time delivery, message history, presence, notifications, search, access control, and tenant isolation. The interviewer is probing how you decompose a Slack-like product into APIs, storage models, fanout paths, realtime connections, and failure-handling semantics, not whether you can name-drop every distributed-systems component. OpenAI cares because many software systems require reliable collaboration, streaming UX, authorization, auditability, and graceful degradation at scale. A strong answer makes tradeoffs explicit: where you choose consistency versus availability, when to fan out on write versus read, and how you preserve user trust when networks, clients, or servers fail.

Core knowledge

Core entities usually include Tenant, User, Workspace, Channel, Membership, Message, Thread, Reaction, Attachment, and ReadReceipt. Model access through membership rows rather than embedding user lists in channels; this supports large channels, private channels, audits, and role changes.
Message ordering should be scoped, not global. Most systems need a stable order per channel or thread, often via (channel_id, sequence_number) or (channel_id, created_at, message_id). Avoid promising total global ordering; it is expensive and unnecessary for collaboration UX.
Realtime delivery is commonly handled with WebSockets or Server-Sent Events from clients to a connection gateway. The gateway should be stateless or lightly stateful, storing connection mappings like user_id -> connection_ids in Redis or an internal presence service.
Durability-first write path usually looks like: authenticate request, authorize channel membership, validate idempotency key, persist message to primary store, publish event to log, then deliver to online recipients. Persist before fanout so a gateway crash does not create “ghost messages” seen by some users but absent from history.
Fanout-on-write pushes each message to recipient inboxes or active connections at send time; it gives fast reads but can be expensive for very large channels. Fanout-on-read stores once per channel and clients fetch unread messages; it is cheaper for huge channels but increases read latency and query load. Many designs use a hybrid: fanout active small channels, read-fetch large or inactive ones.
Back-of-envelope capacity should drive architecture. If 10M daily users send 50 messages/day, that is 500M messages/day, or about $5{,}800$ writes/sec average; with $10\times$ peak, design for ~60k writes/sec. If average fanout is 20 recipients, delivery events can exceed 1M/sec at peak.
Storage choices should reflect access patterns. Postgres can work for smaller systems or strong relational constraints, but high-scale message history often uses wide-column or log-structured stores like Cassandra, DynamoDB, or sharded MySQL. Common primary query: “latest N messages for channel ordered descending,” so partition by channel_id and bucket by time if channels can become hot.
Hot channels need special handling. A single channel_id partition can overload if a large public channel has high write volume. Mitigations include partitioning by (channel_id, time_bucket), splitting delivery fanout across workers, caching recent messages in Redis, and applying rate limits for abusive integrations or bots.
At-least-once delivery is the practical default for realtime systems. Clients and servers should deduplicate using a stable message_id or client-provided idempotency_key. Exactly-once end-to-end delivery across mobile networks, gateways, queues, and databases is usually not worth promising; aim for durable storage plus idempotent retries.
Presence is ephemeral and should not be stored like messages. Use heartbeats, leases, and TTLs: a user is online if a connection heartbeat was observed within, for example, 30–60 seconds. Presence should tolerate false positives/negatives and degrade independently from message send/read paths.
Notifications are a separate asynchronous path. After message persistence, emit events to workers that compute mentions, mute settings, device tokens, and push/email policies. Do not block message send on APNs, FCM, or email provider latency; retries and dead-letter queues belong in the notification pipeline.
Security and tenancy must be designed into every data access path. Include tenant-scoped IDs, authorization checks on send/read/search, encryption in transit via TLS, encryption at rest, audit logs for admin actions, and data-retention policies. Multi-tenant systems also need noisy-neighbor isolation through quotas, per-tenant rate limits, and shard placement controls.

Worked example

For Design Slack-like messaging platform, start by clarifying scope in the first 30 seconds: “Are we designing one workspace product with channels, DMs, threads, search, presence, and notifications? What scale should I assume: 10M daily users, 100k concurrent connections, or larger? Do we need enterprise compliance features in v1?” Then state assumptions: text messages first, attachments via object storage, per-channel ordering, at-least-once delivery, and high availability over strict global consistency.

Organize the answer around five pillars: API surface, data model, write/read path, realtime delivery, and operational/security concerns. For APIs, sketch endpoints like POST /channels/{id}/messages, GET /channels/{id}/messages?before=..., POST /channels/{id}/join, and a WebSocket subscription protocol. For data, explain Message(channel_id, message_id, sender_id, sequence, body, created_at) plus Membership(channel_id, user_id, role) and indexes for recent history. For realtime, place a connection gateway behind a load balancer, publish persisted message events to Kafka or a similar log, and have fanout workers deliver to gateway nodes holding recipient connections.

A concrete tradeoff to flag is fanout strategy: for normal channels, fanout-on-write gives low latency; for very large announcement channels, store once and have clients pull or use partitioned broadcast workers. Close by saying that with more time you would deepen search indexing with Elasticsearch or OpenSearch, enterprise retention/eDiscovery, mobile offline sync, and chaos testing for gateway or broker failures.

A second angle

For Design a multi-tenant Slack-like messenger, the same architecture applies, but the interviewer expects stronger emphasis on tenant isolation, authorization, and operational controls. Every table and cache key should include tenant_id, and every API should authorize both identity and workspace membership before reading or writing messages. Sharding can be by tenant for isolation, by channel for scale, or a hybrid where large tenants get dedicated shards while small tenants share pooled infrastructure. The harder tradeoff is not just latency; it is preventing noisy neighbors, supporting per-tenant retention/export policies, and ensuring search indexes do not leak documents across tenants. This framing rewards candidates who treat multi-tenancy as a first-class invariant rather than a field added at the end.

Common pitfalls

Pitfall: Designing only the happy-path WebSocket flow.

A tempting answer is “client sends message over WebSocket, server broadcasts to channel.” That misses durability, offline users, retries, message history, and notification paths. A stronger answer persists the message first, publishes an event, delivers to active connections, and lets reconnecting clients fetch missed messages by cursor.

Pitfall: Overpromising exactly-once delivery and global ordering.

Candidates often say the system guarantees exactly-once messages in order everywhere. In practice, mobile clients reconnect, gateways crash, queues retry, and multi-region replication reorders events. A better stance is per-channel ordering, idempotent sends, stable message IDs, deduplication on clients, and at-least-once delivery with durable history as the source of truth.

Pitfall: Treating multi-tenant security as a final checklist item.

Saying “we’ll add auth and encryption” near the end is too shallow for a collaboration product. Authorization must appear in the data model, API layer, search indexing, caches, notifications, audit logs, and admin tooling. The interviewer wants to see that tenant boundaries are enforced on every read and write path, not just at login.

Connections

Interviewers may pivot from this topic into event-driven architecture, notification systems, search indexing, rate limiting, multi-region replication, or webhook delivery. The same ideas also connect to collaborative editing, where ordering and conflict resolution become stricter and may introduce CRDTs or operational transforms.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts