Design a team messaging platform similar to Slack that supports multiple organizations (multi-tenancy) and is deployed globally.
Functional requirements
-
Users can belong to one or more
workspaces
(tenants/organizations).
-
Each workspace has multiple
channels
(public and private) and
direct messages (DMs)
.
-
Users can:
-
Send and receive real-time text messages in channels and DMs.
-
See message history in channels and DMs.
-
See basic presence (online/away) for other users in the same workspace.
-
Messages must be delivered with low latency (e.g., p95 < 200 ms) for active users.
Non-functional & multi-tenant requirements
-
The service must support
millions of users
across
tens of thousands of workspaces
.
-
Multi-tenancy
:
-
Strict data isolation between workspaces: users in one workspace must never see data from another workspace.
-
Different workspaces can have different configurations and limits (e.g., message retention, file size limits).
-
The system should defend against noisy neighbors (one tenant over-consuming shared resources).
-
Global deployment
:
-
Users are geographically distributed (e.g., Americas, Europe, Asia).
-
Users should connect to a nearby region for good latency.
-
Many large organizations have employees in multiple regions in the
same workspace
.
Design tasks
Describe a design that covers at least the following aspects:
-
API and high-level architecture
-
Key services (e.g., gateway/API layer, auth, workspace/channel management, messaging, presence, search, notification).
-
How clients (web/desktop/mobile) connect to the system for real-time messaging (e.g., WebSockets, long polling).
-
Data model and storage
-
Core entities:
Workspace (Tenant)
,
User
,
Membership
,
Channel
,
Message
.
-
What storage technologies you would use for:
-
Metadata (users, workspaces, channels, memberships).
-
Messages and their history.
-
How you would
partition/shard
data to scale to many tenants and users.
-
Multi-tenant architecture
-
How you will represent tenant boundaries in the data model and APIs (e.g.,
tenant_id
everywhere).
-
Options for physically storing tenant data: fully shared DB with a
tenant_id
column, separate DB per tenant, or a hybrid; discuss pros/cons.
-
How you enforce security and isolation across all layers (auth, services, storage).
-
Handling noisy neighbors (rate limiting, quotas, priority or dedicated resources for large tenants).
-
Global deployment and replication
-
How you would deploy the system into multiple regions.
-
How users get routed to the closest region (e.g., DNS, anycast, global load balancers).
-
How data for a single global workspace is handled when users are in multiple regions:
-
Where is the
source of truth
for messages of a workspace?
-
How are messages replicated across regions (e.g., asynchronous replication, regional caches)?
-
What consistency guarantees do you provide (e.g., eventual consistency across regions vs strong consistency within a region)?
-
Strategies for regional failover and disaster recovery.
-
Scalability and performance
-
How you would scale:
-
WebSocket / real-time connections.
-
Message fan-out to many subscribers in a busy channel.
-
Message storage and retrieval.
-
Caching strategies and indexing for recent history vs deep history.
-
Other considerations
(at a high level)
-
Search and message indexing across channels in a workspace.
-
File attachments (storage and access controls) if you have time.
-
Security (encryption in transit/at rest, per-tenant encryption keys, audit logging).
Explain the trade-offs you are making (e.g., consistency vs availability, shared vs isolated tenant storage) and justify your choices in terms of reliability, cost, and operational complexity.