PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/OpenAI

Design a Slack-Like Messaging System

Last updated: Jun 17, 2026

Quick Overview

This System Design question evaluates a candidate's ability to architect a scalable, durable real-time messaging system, testing skills in distributed systems, data modeling, API design, persistence versus low-latency delivery, ordering guarantees, missed-message recovery, notifications, and operational scaling.

  • medium
  • OpenAI
  • System Design
  • Software Engineer

Design a Slack-Like Messaging System

Company: OpenAI

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

Design a Slack-like team messaging system focused on sending and receiving messages in real time. Your design should support workspaces with channels and direct messages, deliver new messages to online clients with low latency, and let a client that was disconnected (or freshly launched) recover everything it missed without dropping or duplicating messages. Walk through the core data model, the send/receive APIs, the real-time delivery path, missed-message recovery, per-conversation ordering, notifications, cold start, and how the design scales. Be explicit about where durability lives versus where the real-time path is just an optimization. ### Constraints & Assumptions - **Scale target:** on the order of $10$M daily active users, $10$M+ workspaces, and large public channels with up to $\sim100$K members. Peak send traffic on the order of $100$K messages/sec across the fleet. - **Latency:** new messages should reach an online recipient in well under $1$ second end to end. - **Durability:** an accepted message must never be lost. Once the server acknowledges a send, the message is recoverable even if every WebSocket drops. - **Ordering:** ordering is required *within a single conversation* (channel or DM). Global ordering across the whole product is **not** required. - **Clients:** web, desktop, and mobile; a single user may be connected from several devices at once, and devices go offline frequently. Assume read/write traffic is read-heavy (history loads, scrollback) relative to writes. - Out of scope: voice/video calls, file storage internals, search ranking, message editing/threading (mention them only as extensions). ### The Problem Produce a complete design that addresses the following. Treat each as a dimension the interviewer will probe, and make the durability-versus-delivery boundary explicit throughout. 1. **Core entities** — users, workspaces, channels, direct messages, memberships, and messages, plus the keys/IDs that tie them together. 2. **Send & receive APIs** — the write path for sending a message and the read paths for live delivery and history. 3. **Real-time delivery** — persistent connections (e.g. WebSockets), a gateway that tracks connections, and how a new message reaches connected members. 4. **Missed-message recovery** — how a client that was disconnected catches up on reconnect without gaps or duplicates. 5. **Ordering guarantees** — what ordering you promise and the mechanism that enforces it. 6. **Notifications** — when to push vs. suppress, and the latency/correctness/UX tradeoffs. 7. **Cold start** — what happens when a user opens the app after being offline (or on a new device) so it loads fast without downloading everything. 8. **Storage, queues, fanout, and scaling** — data stores, the event bus, fanout strategy, and the bottlenecks at the scale above. ```hint Where to start — pin down the source of truth Before sketching boxes, decide one thing: when a send is accepted, what is the *authoritative* record of it, and is the real-time push part of that record or separate from it? If the socket is just one way a client learns about a message it could also fetch another way, what does that imply about the order in which you persist versus deliver? Get this ordering rule straight and recovery, ordering, and failure handling become consequences of it rather than separate problems. ``` ```hint One key, two jobs Look for a single property you can attach to each message that simultaneously (a) defines order within its conversation and (b) lets a reconnecting client say exactly where it left off. If two senders' clocks disagree, or two messages land at the same instant, can the value you picked still answer both questions unambiguously? Pin down what property the value must have to survive those cases, what scope it should be unique/monotonic over — global, or something narrower — and who is responsible for assigning it, including what that centralization costs. ``` ```hint When fanout stops being free Walk the numbers for one message posted to a $100$K-member channel versus a $5$-person DM. Naive "deliver to every member the instant it arrives" is fine for one and ruinous for the other. What distinguishes the members you *must* reach right now from the ones who will find out later anyway? Could you serve those two populations with two different mechanisms, and what dimension would you partition the store and workers along to keep each conversation's traffic together? ``` ```hint Notifications aren't just a copy of delivery A "you have a new message" decision depends on state the message event itself doesn't carry: is the recipient online, is this conversation focused, muted, a mention, a DM? If you tried to drive notifications off the same code path that pushes messages to sockets, where would that state have to live? What does that suggest about whether notifications belong on the delivery path or somewhere else that consumes the same events? ``` ### Clarifying Questions to Ask - What ordering semantics are required — strict per-conversation order, or is best-effort with client-side reordering acceptable? - What is the read/write ratio, and how large can a single channel get (10s vs. 100K members)? - Do we need exactly-once *display* (dedup on the client) or is at-least-once delivery with client dedup fine? - How many simultaneous devices per user, and must all devices stay consistent (read cursors, unread counts)? - What are the latency and durability SLAs, and is any cross-region/geo-replication requirement in scope? - Are threads, edits, reactions, and presence in scope now, or future extensions? ### What a Strong Answer Covers - A clear data model with a justified partition key and an ordering key, and reasoning for why those choices fit the access patterns. - An explicit, defended position on the relationship between durability and delivery — which one must happen first, and why that ordering is load-bearing. - A send path and a separate live-delivery path, with a connection-tracking tier that knows where each user/device is connected. - A correct, gap-free missed-message recovery scheme, including how the client efficiently discovers *which* of many conversations changed before pulling each one. - A concrete ordering mechanism, plus an honest discussion of the bottleneck or failure mode that mechanism introduces and how to mitigate it. - A fanout strategy that distinguishes small conversations from very large channels, and reasons about cost as a function of membership and connectivity. - Notifications modeled as a separate, user-state-aware concern, with stated latency / correctness / UX tradeoffs. - A cold-start flow that prioritizes what the user sees first and defers the rest, rather than downloading everything. - Failure handling: how duplicates are suppressed, how gaps are detected and repaired, how retries stay idempotent, and what happens when each component fails. - A sense of scale: where the bottlenecks are and how the chosen partitioning/caching addresses them. ### Follow-up Questions - How do you keep multiple devices for the same user consistent on read cursors and unread counts? - How would you add message **edits and deletes** while preserving the ordering and recovery model based on sequence numbers? - How do you guarantee a client never *permanently* misses a message if it disconnects in the middle of a fanout — and how does it detect a sequence gap? - How would you extend this to support threaded replies, or reactions, without breaking per-conversation ordering? - How would you handle a single extremely hot channel (e.g. a company-wide announcement channel) where the single-sequencer approach becomes a bottleneck?

Quick Answer: This System Design question evaluates a candidate's ability to architect a scalable, durable real-time messaging system, testing skills in distributed systems, data modeling, API design, persistence versus low-latency delivery, ordering guarantees, missed-message recovery, notifications, and operational scaling.

Related Interview Questions

  • Design Video Generation Orchestration - OpenAI (medium)
  • Design CI/CD Build Caching - OpenAI
  • Design an Instagram-like Feed System - OpenAI (medium)
  • Design Online Chess Matchmaking - OpenAI (hard)
  • Design Android MVVM API Architecture - OpenAI (medium)
OpenAI logo
OpenAI
Apr 26, 2026, 12:00 AM
Software Engineer
Technical Screen
System Design
32
0

Design a Slack-like team messaging system focused on sending and receiving messages in real time. Your design should support workspaces with channels and direct messages, deliver new messages to online clients with low latency, and let a client that was disconnected (or freshly launched) recover everything it missed without dropping or duplicating messages.

Walk through the core data model, the send/receive APIs, the real-time delivery path, missed-message recovery, per-conversation ordering, notifications, cold start, and how the design scales. Be explicit about where durability lives versus where the real-time path is just an optimization.

Constraints & Assumptions

  • Scale target: on the order of 101010 M daily active users, 101010 M+ workspaces, and large public channels with up to ∼100\sim100∼100 K members. Peak send traffic on the order of 100100100 K messages/sec across the fleet.
  • Latency: new messages should reach an online recipient in well under 111 second end to end.
  • Durability: an accepted message must never be lost. Once the server acknowledges a send, the message is recoverable even if every WebSocket drops.
  • Ordering: ordering is required within a single conversation (channel or DM). Global ordering across the whole product is not required.
  • Clients: web, desktop, and mobile; a single user may be connected from several devices at once, and devices go offline frequently. Assume read/write traffic is read-heavy (history loads, scrollback) relative to writes.
  • Out of scope: voice/video calls, file storage internals, search ranking, message editing/threading (mention them only as extensions).

The Problem

Produce a complete design that addresses the following. Treat each as a dimension the interviewer will probe, and make the durability-versus-delivery boundary explicit throughout.

  1. Core entities — users, workspaces, channels, direct messages, memberships, and messages, plus the keys/IDs that tie them together.
  2. Send & receive APIs — the write path for sending a message and the read paths for live delivery and history.
  3. Real-time delivery — persistent connections (e.g. WebSockets), a gateway that tracks connections, and how a new message reaches connected members.
  4. Missed-message recovery — how a client that was disconnected catches up on reconnect without gaps or duplicates.
  5. Ordering guarantees — what ordering you promise and the mechanism that enforces it.
  6. Notifications — when to push vs. suppress, and the latency/correctness/UX tradeoffs.
  7. Cold start — what happens when a user opens the app after being offline (or on a new device) so it loads fast without downloading everything.
  8. Storage, queues, fanout, and scaling — data stores, the event bus, fanout strategy, and the bottlenecks at the scale above.

Clarifying Questions to Ask

  • What ordering semantics are required — strict per-conversation order, or is best-effort with client-side reordering acceptable?
  • What is the read/write ratio, and how large can a single channel get (10s vs. 100K members)?
  • Do we need exactly-once display (dedup on the client) or is at-least-once delivery with client dedup fine?
  • How many simultaneous devices per user, and must all devices stay consistent (read cursors, unread counts)?
  • What are the latency and durability SLAs, and is any cross-region/geo-replication requirement in scope?
  • Are threads, edits, reactions, and presence in scope now, or future extensions?

What a Strong Answer Covers

  • A clear data model with a justified partition key and an ordering key, and reasoning for why those choices fit the access patterns.
  • An explicit, defended position on the relationship between durability and delivery — which one must happen first, and why that ordering is load-bearing.
  • A send path and a separate live-delivery path, with a connection-tracking tier that knows where each user/device is connected.
  • A correct, gap-free missed-message recovery scheme, including how the client efficiently discovers which of many conversations changed before pulling each one.
  • A concrete ordering mechanism, plus an honest discussion of the bottleneck or failure mode that mechanism introduces and how to mitigate it.
  • A fanout strategy that distinguishes small conversations from very large channels, and reasons about cost as a function of membership and connectivity.
  • Notifications modeled as a separate, user-state-aware concern, with stated latency / correctness / UX tradeoffs.
  • A cold-start flow that prioritizes what the user sees first and defers the rest, rather than downloading everything.
  • Failure handling: how duplicates are suppressed, how gaps are detected and repaired, how retries stay idempotent, and what happens when each component fails.
  • A sense of scale: where the bottlenecks are and how the chosen partitioning/caching addresses them.

Follow-up Questions

  • How do you keep multiple devices for the same user consistent on read cursors and unread counts?
  • How would you add message edits and deletes while preserving the ordering and recovery model based on sequence numbers?
  • How do you guarantee a client never permanently misses a message if it disconnects in the middle of a fanout — and how does it detect a sequence gap?
  • How would you extend this to support threaded replies, or reactions, without breaking per-conversation ordering?
  • How would you handle a single extremely hot channel (e.g. a company-wide announcement channel) where the single-sequencer approach becomes a bottleneck?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More OpenAI•More Software Engineer•OpenAI Software Engineer•OpenAI System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.