Slack-Like Messaging Systems
Asked of: Software Engineer
Last updated

What's being tested
Interviewers are probing whether you can design a real-time multi-tenant messaging system with clear data models, APIs, delivery semantics, storage strategy, and failure handling. A strong answer balances low-latency fanout, durable message history, permissions, search, notifications, and operational concerns without overbuilding every subsystem. OpenAI cares because many products involve collaborative, streaming, user-facing systems where correctness, privacy, latency, and graceful degradation all matter. The interviewer is not looking for “use WebSockets and Kafka” as a slogan; they want to see how you reason through tradeoffs like online vs offline delivery, channel fanout, ordering, tenant isolation, and backpressure.
Core knowledge
-
Core entities usually include
User,Workspace,Channel,Membership,Message,Thread,Reaction,Attachment, andReadReceipt. Model workspace-scoped IDs and permissions explicitly; multi-tenant systems fail when access checks are treated as an afterthought instead of part of every read/write path. -
API design should separate durable writes from real-time delivery. Typical endpoints:
POST /messages,GET /channels/{id}/messages?before=...,POST /channels/{id}/join, and a persistent connection endpoint like/realtime.POST /messagesshould return after persistence, not after every recipient receives the message. -
Persistent connections are commonly implemented with WebSockets, Server-Sent Events, or long polling.
WebSocketsupports bidirectional events for typing indicators and presence;SSEis simpler for server-to-client streams. For mobile and unreliable networks, clients need reconnect tokens, heartbeats, and “resume from sequence number.” -
Message durability belongs in a primary store such as
DynamoDB,Cassandra,ScyllaDB,MySQL, orPostgres, depending on scale. A common schema is partition bychannel_idand sort bymessage_tsor monotonically increasingmessage_id. Hot channels can overload a single partition, so consider bucketed partitions like(channel_id, day)or(channel_id, shard). -
Ordering semantics should be stated precisely. Global total ordering is expensive and usually unnecessary; Slack-like systems commonly need per-channel ordering. Use server-assigned Snowflake-style IDs,
ULID, or a sequencer per channel. If clients send messages concurrently, show optimistic local rendering but reconcile against server order. -
Fanout strategy depends on channel size. For small channels, fanout-on-write pushes an event to each online member’s connection server and notification pipeline. For very large channels, fanout-on-read or hybrid fanout avoids writing millions of inbox rows. A useful threshold: direct messages and small groups fan out eagerly; channels with 100k+ members need pull-based consumption and pagination.
-
Real-time delivery architecture often uses connection gateways plus an internal event bus. A request service persists the message, publishes
MessageCreated(channel_id, seq)toKafka,Pulsar,Redis Streams, orNATS, and gateway servers subscribed to relevant channels deliver to connected clients. Gateways should be stateless except for ephemeral connection mappings. -
Delivery guarantees should be practical: usually at-least-once delivery with client-side de-duplication by
message_id. Exactly-once end-to-end is rarely worth claiming. Clients should maintainlast_seen_seqper channel and call a history API to fill gaps after reconnect or missed events. -
Presence and typing indicators are ephemeral, not durable messages. Store presence in
Rediswith TTLs and heartbeat updates, e.g.,presence:user_id -> online until t. Avoid writing every typing event to durable storage; throttle events and treat them as best-effort to reduce load. -
Read receipts and unread counts can be modeled as
last_read_message_idper(user_id, channel_id). Unread count can be computed as messages after the marker for small channels, but at scale you may maintain counters or approximate badges. Be careful with edits, deletes, hidden messages, and per-user visibility. -
Search indexing is a separate read path. Persist messages first, then asynchronously index into
Elasticsearch,OpenSearch, or a dedicated search service. Search documents should include workspace, channel, sender, timestamp, permissions metadata, and tokenized content; results must be filtered by current membership and retention policy. -
Security and compliance include authentication, authorization, tenant isolation, audit logs, encryption, retention, and deletion. Use workspace-scoped authorization checks on every message fetch and publish path. Encrypt in transit with
TLS; encrypt at rest with managed keys, and discuss enterprise features like legal holds only at a high level unless prompted.
Worked example
For “Design a Slack-like messaging platform”, start by clarifying scope: “Are we designing team chat with workspaces, channels, DMs, message history, search, notifications, and presence? What scale should I assume: 10M daily users, 100k messages/sec peak, and p99 send-to-display under 500ms for online users?” Then declare your assumptions: per-channel ordering is required, offline users can catch up via history, and message persistence is the source of truth. Organize the answer around four pillars: data model, write/read APIs, real-time delivery, and storage/indexing/notifications. For the write path, say the client calls POST /messages, the message service validates membership, assigns message_id and channel sequence, writes to the message store, then publishes an event to an internal bus. For the read path, online clients receive events over WebSocket, while reconnecting clients use GET /messages?after_seq=... to fill gaps. For storage, use a channel-partitioned message table, but call out hot partitions for giant channels and propose bucketing or hybrid fanout. A concrete tradeoff to flag: fanout-on-write gives lower latency for small groups but explodes for large public channels, so use a hybrid strategy based on member count and online subscriber count. Close by saying: “If I had more time, I’d drill into search indexing, retention/deletion semantics, notification ranking, and operational metrics like p99 delivery latency, reconnect gap rate, and message send error rate.”
A second angle
For “Design an AI chatbot with browser storage”, the same messaging concepts apply, but the constraints shift toward client-side state, streaming, and privacy. Instead of multi-user channels and workspace permissions, the core entities are local conversations, messages, model responses, and session metadata stored in browser storage such as IndexedDB. Real-time delivery becomes token streaming from a backend relay using SSE or WebSocket, with the client appending partial assistant messages as chunks arrive. The main design decision is whether conversation history is purely local or synced to a server; browser-only storage improves privacy but complicates cross-device continuity, backup, and quota handling. You should also discuss not exposing provider API keys in the browser, using a stateless relay, and handling refresh/reconnect without duplicating assistant responses.
Common pitfalls
Pitfall: Jumping straight to
KafkaandWebSocketswithout defining guarantees.
A weak answer lists technologies before explaining semantics. A better answer says, “We provide durable persistence before acknowledgement, at-least-once real-time delivery, client de-duplication by message_id, and history replay after reconnect,” then chooses tools that support those properties.
Pitfall: Treating all channels the same size.
Designs that fan out every message to every member work for DMs and small teams but collapse for huge announcement channels. Segment the problem: small channels get eager push, large channels get subscription-based delivery for online users and pull-based history for everyone else.
Pitfall: Ignoring authorization on read paths.
Many candidates remember to check membership on POST /messages but forget search, history pagination, attachments, notifications, and WebSocket subscriptions. Strong answers make authorization a cross-cutting invariant: every event and query is scoped by workspace, channel membership, retention policy, and user visibility.
Connections
Interviewers may pivot from this into notification systems, search indexing, distributed ID generation, rate limiting, or multi-tenant authorization. They may also ask you to zoom into client behavior: offline sync, local caching, optimistic UI, retry logic, and streaming responses for AI chat interfaces.
Further reading
-
The Log: What every software engineer should know about real-time data’s unifying abstraction — Jay Kreps’ classic explanation of logs as the backbone of event-driven systems.
-
Designing Data-Intensive Applications — Martin Kleppmann’s book covers replication, partitioning, consistency, and stream processing tradeoffs directly relevant to messaging systems.
-
Slack Engineering Blog — practical posts on operating large-scale collaboration infrastructure, reliability, and client/server performance.