Design a Real-Time Chess Service
Company: OpenAI
Role: Backend Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
Design an online chess service that supports **real-time, two-player games**.
Users should be able to create or join a game, make legal moves, and see their opponent's moves with low latency. Games end by **checkmate, timeout (flag fall), resignation, draw** (agreement, stalemate, threefold repetition, fifty-move rule, or insufficient material), or **abandonment**. The server is the authority on rules, move ordering, and the clock — clients are never trusted.
Your design should address, in particular:
1. **State representation & persistence** — how you model and store a game's position and history.
2. **Move validation & ordering** — how the server rejects illegal/out-of-turn moves and keeps the two players' views totally consistent.
3. **Clock accuracy** — how each player's remaining time is tracked authoritatively, including a timeout that occurs while a player is simply thinking.
4. **Failure recovery** — what happens when a client disconnects, crashes, or reconnects from a different device, and when a server-side game owner fails.
5. **Scaling** — how the service handles millions of concurrent games while preserving per-game correctness.
The system must feel real-time: a move should typically reach the opponent in well under a few hundred milliseconds. State any additional assumptions you make.
```hint Where to start
Before drawing boxes, characterize the workload of a *single* game: how many messages per minute, how strict is the ordering, how latency-sensitive is each one? Then multiply by the concurrency target. The shape of the answer — what's small per game vs. what's enormous across games — should tell you where the real engineering problem is, and which of your design choices that problem ought to drive.
```
```hint The key simplifier
For each turn of one game, what is the smallest unit you'd be willing to let process commands one at a time? If you pin that responsibility somewhere specific, validation, turn order, and clock math might stop needing distributed locks or cross-node transactions. What's the unit, and where does it live?
```
```hint State representation
You probably don't want one board encoding for everything — the cheapest *durable write*, the fastest *in-memory legality check*, and the most convenient *snapshot/resync* format may not be the same thing. Sketch what each layer wants. Then ask: among the draw rules, is there one that depends on *history* rather than the current position — and if so, what minimal extra bookkeeping covers it?
```
```hint Clocks
Storing a per-second countdown is wasteful and drifts under jitter. What's the minimal set of fields from which you could *derive* a player's live remaining time on demand? Once you have that, two cases get interesting: nobody ever moves and a clock simply runs out, and the owning process dies mid-turn — what has to be true about the timestamp you stored for a fresh owner to recompute the clock correctly?
```
```hint Reliability of a move
Decide your durability boundary: at what exact point is a move safe enough to tell the mover "it counted"? Then think about the retry that follows a crash — what could the client attach to a move so the server can recognize a duplicate and apply it no more than once?
```
### Constraints & Assumptions
State your own; a reasonable set:
- **Concurrency**: design for a peak of ~$1\text{M}$ concurrent games $\Rightarrow$ ~$2\text{M}$ live socket connections.
- **Per-game traffic**: a blitz game is ~40 moves over a few minutes $\approx 0.1$ moves/sec/game — latency-sensitive, not throughput-bound.
- **Latency target**: p99 move round-trip (mover → server → opponent) under ~200 ms within a region.
- **Correctness (strong, per game)**: moves are totally ordered; no illegal or out-of-turn move is ever accepted; an acknowledged move is never lost; clocks never silently go negative.
- **Time controls**: support base time + optional per-move increment (e.g. `5+3` = 5 min + 3 s/move) and delay variants; the **server** is the single source of truth for remaining time.
- **Out of scope** (leave room, but don't design): anti-cheat / engine detection, rating math (Elo/Glicko), spectators, chat, tournaments, puzzles.
### Clarifying Questions to Ask
- What time controls must we support (bullet/blitz/rapid/classical, increment vs. delay), and is **lag compensation** a product requirement?
- How strict is the latency SLA, and is play single-region or must we handle two players in **different regions**?
- On a disconnect, does the absent player's clock keep running, and how long is the **grace period** before abandonment is declared?
- Must a player be able to connect from **two devices at once** (e.g. phone + laptop), and do we need spectator/replay support now or later?
- What scale are we targeting (peak concurrent games), and what durability guarantee does an accepted move need?
- Are draws by threefold repetition / fifty-move **auto-applied** by the server or **claimed** by a player?
### What a Strong Answer Covers
- A clear **functional / non-functional requirements** split and back-of-the-envelope sizing, with a defensible conclusion about *where the real bottleneck is* — and design choices that follow from it rather than generic boxes.
- A coherent **per-game ordering strategy** that makes validation, turn order, and clock math race-free, with a clear argument for why it avoids distributed locks or cross-node transactions (any defensible mechanism, not one specific blessed answer).
- A concrete **state/persistence model** that distinguishes the durable source of truth from fast in-memory and resync representations, and that correctly handles every termination condition — including whichever draw rule needs more than the current position.
- **Server-authoritative move validation** with a precise, well-ordered step sequence, and a clear position on how legality and the clock interact at the instant a flag could fall.
- An **authoritative clock model** that derives live time rather than ticking, handles a flag fall when no move arrives, and — critically — survives a process crash so a fresh owner recomputes the right remaining time; bonus for reasoning rigorously about which time source is trustworthy across a crash.
- A credible **reconnect & recovery** story covering gap detection, catch-up vs. fresh snapshot, safe move retries, and how a failed server-side owner is rebuilt (state *and* clock), with honesty about where the recovered clock could be off.
- A **scaling strategy** that keeps each game correct and ordered while millions run in parallel — partitioning, connection fan-out, hot/cold storage, and regional placement — plus the cross-region and spectator follow-ups.
- Explicit **edge cases and tradeoffs**, stated as costs the candidate has reasoned about rather than glossed (e.g. the price of the durability boundary, the limits of their ordering choice, residual time-source risk, lag compensation as a policy decision).
### Follow-up Questions
- How do you keep a player from being charged clock time for the **network round-trip** of delivering their move? Sketch a concrete lag-compensation policy.
- A worker holding 50k games crashes. Walk through exactly how a new owner rebuilds each game's position **and** the active player's remaining clock — and where the recovered clock could be off.
- Two players are in different regions. Where do you place the game's authoritative owner, and how do you keep *both* players' move latency acceptable?
- How would you extend the design to support **spectators** (thousands watching one popular game) without slowing the players' move path?
Quick Answer: This question evaluates proficiency in designing real-time, authoritative backend systems with emphasis on state representation and persistence, strict move validation and ordering, authoritative clock management, failure recovery, and preserving per-game correctness at scale under low-latency constraints.