Design a Real-Time Chat System
Company: Uber
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
## Design a Real-Time Chat System
Design the backend for a real-time one-to-one and group messaging application (think a 1:1 and small-group chat product like a WhatsApp- or Slack-style messenger). Users can send text messages to other users or to a group, see messages appear in near real time, see delivery and read state, and read their full message history when they come back online — including any messages that arrived while they were disconnected.
The interview deliberately drills into the **communication layer**: what network protocol and transport you use between client and server to push messages in real time, and why you chose it over the alternatives. Be prepared to justify the transport choice, how you maintain the live connection, and how a sender's message reaches an offline recipient who later reconnects.
```hint Where to start
Separate the two hard sub-problems: (1) the *real-time delivery* path — how a message gets pushed to an online recipient with low latency — and (2) the *durability / catch-up* path — how messages are stored so an offline recipient can fetch what they missed. They are answered by different components.
```
```hint Transport choice
The interviewer wants the trade-off between a long-lived bidirectional connection (e.g. **WebSocket**) and request/response polling (short polling, long polling, SSE). Anchor your answer in: server-initiated push, latency, connection overhead, and how each behaves through proxies/load balancers and on mobile radios.
```
```hint Offline delivery
An online recipient is reached over their live connection; an offline one is not. You need a per-user durable inbox or a message log keyed by conversation plus a "last delivered/last read" cursor, so a reconnecting client can pull the gap. Think about which datastore gives you cheap append + range-read by conversation and time.
```
### Constraints & Assumptions
State and defend your own numbers; reasonable working assumptions for this exercise:
- ~50M daily active users; a few hundred thousand to ~1M concurrent live connections at peak.
- Each user sends on the order of tens of messages per day; system peak on the order of ~100k messages/second.
- Messages are small (text, a few hundred bytes); media is uploaded out-of-band to blob storage and only a reference travels through chat.
- Target end-to-end delivery latency for online users in the low hundreds of milliseconds (p99).
- Messages must be durable and ordered within a conversation; no message may be silently lost.
- Groups are small-to-medium (up to a few hundred members), not broadcast-scale channels.
- Single-region reasoning is acceptable as a baseline; note what changes for multi-region.
### Clarifying Questions to Ask
- What is the read/write balance and the target latency — is this optimized for live delivery, history fetch, or both equally?
- Do we need 1:1 only, or also group chat, and how large can a group get? (Fan-out cost scales with group size.)
- What delivery semantics are required — at-least-once with client-side dedup, or exactly-once? Do we need ordering guarantees per conversation?
- Which presence/receipt features are in scope: online/last-seen, "delivered", "read", and typing indicators?
- Is end-to-end encryption a requirement, or is transport-layer (TLS) encryption with server-side storage acceptable?
- What is the device model — one device per user, or multiple devices that must all stay in sync?
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- Walk through exactly what happens, component by component, when user A sends a message to user B while B is offline and then B reconnects 10 minutes later.
- The interviewer pushed on the transport: defend WebSocket over HTTP long polling and over Server-Sent Events for this workload. When would you actually prefer SSE or long polling?
- How do you guarantee per-conversation ordering and deduplicate retried sends so a flaky mobile client never shows a message twice or out of order?
- How does the design change for a group with 200 members — do you fan out on write or on read, and what is the cost of each?
- How do you scale the stateful connection tier and route a message to the specific server that currently holds the recipient's socket?
Quick Answer: This system design question tests a candidate's ability to architect a scalable real-time messaging platform, focusing on transport protocol selection and the trade-offs between live delivery and durable offline catch-up. It evaluates practical knowledge of distributed systems concepts including WebSocket vs. polling strategies, stateful connection management, and message persistence at scale.