Design Slack-like multi-tenant global messaging system
Company: OpenAI
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
Design a team messaging platform similar to Slack that supports **multiple organizations (multi-tenancy)** and is **deployed globally**.
### Functional requirements
- Users can belong to one or more **workspaces** (tenants/organizations).
- Each workspace has multiple **channels** (public and private) and **direct messages (DMs)**.
- Users can:
- Send and receive real-time text messages in channels and DMs.
- See message history in channels and DMs.
- See basic presence (online/away) for other users in the same workspace.
- Messages must be delivered with low latency (e.g., p95 < 200 ms) for active users.
### Non-functional & multi-tenant requirements
- The service must support **millions of users** across **tens of thousands of workspaces**.
- **Multi-tenancy**:
- Strict data isolation between workspaces: users in one workspace must never see data from another workspace.
- Different workspaces can have different configurations and limits (e.g., message retention, file size limits).
- The system should defend against noisy neighbors (one tenant over-consuming shared resources).
- **Global deployment**:
- Users are geographically distributed (e.g., Americas, Europe, Asia).
- Users should connect to a nearby region for good latency.
- Many large organizations have employees in multiple regions in the **same workspace**.
### Design tasks
Describe a design that covers at least the following aspects:
1. **API and high-level architecture**
- Key services (e.g., gateway/API layer, auth, workspace/channel management, messaging, presence, search, notification).
- How clients (web/desktop/mobile) connect to the system for real-time messaging (e.g., WebSockets, long polling).
2. **Data model and storage**
- Core entities: `Workspace (Tenant)`, `User`, `Membership`, `Channel`, `Message`.
- What storage technologies you would use for:
- Metadata (users, workspaces, channels, memberships).
- Messages and their history.
- How you would **partition/shard** data to scale to many tenants and users.
3. **Multi-tenant architecture**
- How you will represent tenant boundaries in the data model and APIs (e.g., `tenant_id` everywhere).
- Options for physically storing tenant data: fully shared DB with a `tenant_id` column, separate DB per tenant, or a hybrid; discuss pros/cons.
- How you enforce security and isolation across all layers (auth, services, storage).
- Handling noisy neighbors (rate limiting, quotas, priority or dedicated resources for large tenants).
4. **Global deployment and replication**
- How you would deploy the system into multiple regions.
- How users get routed to the closest region (e.g., DNS, anycast, global load balancers).
- How data for a single global workspace is handled when users are in multiple regions:
- Where is the **source of truth** for messages of a workspace?
- How are messages replicated across regions (e.g., asynchronous replication, regional caches)?
- What consistency guarantees do you provide (e.g., eventual consistency across regions vs strong consistency within a region)?
- Strategies for regional failover and disaster recovery.
5. **Scalability and performance**
- How you would scale:
- WebSocket / real-time connections.
- Message fan-out to many subscribers in a busy channel.
- Message storage and retrieval.
- Caching strategies and indexing for recent history vs deep history.
6. **Other considerations** (at a high level)
- Search and message indexing across channels in a workspace.
- File attachments (storage and access controls) if you have time.
- Security (encryption in transit/at rest, per-tenant encryption keys, audit logging).
Explain the trade-offs you are making (e.g., consistency vs availability, shared vs isolated tenant storage) and justify your choices in terms of reliability, cost, and operational complexity.
Quick Answer: This question evaluates a candidate's ability to design large-scale, multi-tenant real-time messaging systems, testing competencies in distributed systems, data modeling, multi-tenancy, replication, latency optimization, and operational isolation.