How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Technical Screen rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Backend Engineer candidates at OpenAI during technical interviews.

Design a Real-Time Chess Service | OpenAI Interview Question

Q: Design a Real-Time Chess Service

This question evaluates proficiency in designing real-time, authoritative backend systems with emphasis on state representation and persistence, strict move validation and ordering, authoritative clock management, failure recovery, and preserving per-game correctness at scale under low-latency constraints.

Design an online chess service that supports real-time, two-player games.

Users should be able to create or join a game, make legal moves, and see their opponent's moves with low latency. Games end by checkmate, timeout (flag fall), resignation, draw (agreement, stalemate, threefold repetition, fifty-move rule, or insufficient material), or abandonment. The server is the authority on rules, move ordering, and the clock — clients are never trusted.

Your design should address, in particular:

State representation & persistence — how you model and store a game's position and history.
Move validation & ordering — how the server rejects illegal/out-of-turn moves and keeps the two players' views totally consistent.
Clock accuracy — how each player's remaining time is tracked authoritatively, including a timeout that occurs while a player is simply thinking.
Failure recovery — what happens when a client disconnects, crashes, or reconnects from a different device, and when a server-side game owner fails.
Scaling — how the service handles millions of concurrent games while preserving per-game correctness.

The system must feel real-time: a move should typically reach the opponent in well under a few hundred milliseconds. State any additional assumptions you make.

Constraints & Assumptions

State your own; a reasonable set:

Concurrency : design for a peak of ~ $1\text{M}$ concurrent games $\Rightarrow$ ~ $2\text{M}$ live socket connections.
Per-game traffic : a blitz game is ~40 moves over a few minutes $\approx 0.1$ moves/sec/game — latency-sensitive, not throughput-bound.
Latency target : p99 move round-trip (mover → server → opponent) under ~200 ms within a region.
Correctness (strong, per game) : moves are totally ordered; no illegal or out-of-turn move is ever accepted; an acknowledged move is never lost; clocks never silently go negative.
Time controls : support base time + optional per-move increment (e.g. 5+3 = 5 min + 3 s/move) and delay variants; the server is the single source of truth for remaining time.
Out of scope (leave room, but don't design): anti-cheat / engine detection, rating math (Elo/Glicko), spectators, chat, tournaments, puzzles.

Clarifying Questions to Ask

What time controls must we support (bullet/blitz/rapid/classical, increment vs. delay), and is lag compensation a product requirement?
How strict is the latency SLA, and is play single-region or must we handle two players in different regions ?
On a disconnect, does the absent player's clock keep running, and how long is the grace period before abandonment is declared?
Must a player be able to connect from two devices at once (e.g. phone + laptop), and do we need spectator/replay support now or later?
What scale are we targeting (peak concurrent games), and what durability guarantee does an accepted move need?
Are draws by threefold repetition / fifty-move auto-applied by the server or claimed by a player?

What a Strong Answer Covers

A clear functional / non-functional requirements split and back-of-the-envelope sizing, with a defensible conclusion about where the real bottleneck is — and design choices that follow from it rather than generic boxes.
A coherent per-game ordering strategy that makes validation, turn order, and clock math race-free, with a clear argument for why it avoids distributed locks or cross-node transactions (any defensible mechanism, not one specific blessed answer).
A concrete state/persistence model that distinguishes the durable source of truth from fast in-memory and resync representations, and that correctly handles every termination condition — including whichever draw rule needs more than the current position.
Server-authoritative move validation with a precise, well-ordered step sequence, and a clear position on how legality and the clock interact at the instant a flag could fall.
An authoritative clock model that derives live time rather than ticking, handles a flag fall when no move arrives, and — critically — survives a process crash so a fresh owner recomputes the right remaining time; bonus for reasoning rigorously about which time source is trustworthy across a crash.
A credible reconnect & recovery story covering gap detection, catch-up vs. fresh snapshot, safe move retries, and how a failed server-side owner is rebuilt (state and clock), with honesty about where the recovered clock could be off.
A scaling strategy that keeps each game correct and ordered while millions run in parallel — partitioning, connection fan-out, hot/cold storage, and regional placement — plus the cross-region and spectator follow-ups.
Explicit edge cases and tradeoffs , stated as costs the candidate has reasoned about rather than glossed (e.g. the price of the durability boundary, the limits of their ordering choice, residual time-source risk, lag compensation as a policy decision).

Follow-up Questions

How do you keep a player from being charged clock time for the network round-trip of delivering their move? Sketch a concrete lag-compensation policy.
A worker holding 50k games crashes. Walk through exactly how a new owner rebuilds each game's position and the active player's remaining clock — and where the recovered clock could be off.
Two players are in different regions. Where do you place the game's authoritative owner, and how do you keep both players' move latency acceptable?
How would you extend the design to support spectators (thousands watching one popular game) without slowing the players' move path?

Design an online chess service that supports real-time, two-player games.

Your design should address, in particular:

State representation & persistence — how you model and store a game's position and history.
Move validation & ordering — how the server rejects illegal/out-of-turn moves and keeps the two players' views totally consistent.
Clock accuracy — how each player's remaining time is tracked authoritatively, including a timeout that occurs while a player is simply thinking.
Failure recovery — what happens when a client disconnects, crashes, or reconnects from a different device, and when a server-side game owner fails.
Scaling — how the service handles millions of concurrent games while preserving per-game correctness.

The system must feel real-time: a move should typically reach the opponent in well under a few hundred milliseconds. State any additional assumptions you make.

Constraints & Assumptions

State your own; a reasonable set:

Concurrency : design for a peak of ~ $1\text{M}$ concurrent games $\Rightarrow$ ~ $2\text{M}$ live socket connections.
Per-game traffic : a blitz game is ~40 moves over a few minutes $\approx 0.1$ moves/sec/game — latency-sensitive, not throughput-bound.
Latency target : p99 move round-trip (mover → server → opponent) under ~200 ms within a region.
Correctness (strong, per game) : moves are totally ordered; no illegal or out-of-turn move is ever accepted; an acknowledged move is never lost; clocks never silently go negative.
Time controls : support base time + optional per-move increment (e.g. 5+3 = 5 min + 3 s/move) and delay variants; the server is the single source of truth for remaining time.
Out of scope (leave room, but don't design): anti-cheat / engine detection, rating math (Elo/Glicko), spectators, chat, tournaments, puzzles.

Clarifying Questions to Ask

What time controls must we support (bullet/blitz/rapid/classical, increment vs. delay), and is lag compensation a product requirement?
How strict is the latency SLA, and is play single-region or must we handle two players in different regions ?
On a disconnect, does the absent player's clock keep running, and how long is the grace period before abandonment is declared?
Must a player be able to connect from two devices at once (e.g. phone + laptop), and do we need spectator/replay support now or later?
What scale are we targeting (peak concurrent games), and what durability guarantee does an accepted move need?
Are draws by threefold repetition / fifty-move auto-applied by the server or claimed by a player?

What a Strong Answer Covers

A clear functional / non-functional requirements split and back-of-the-envelope sizing, with a defensible conclusion about where the real bottleneck is — and design choices that follow from it rather than generic boxes.
A coherent per-game ordering strategy that makes validation, turn order, and clock math race-free, with a clear argument for why it avoids distributed locks or cross-node transactions (any defensible mechanism, not one specific blessed answer).
A concrete state/persistence model that distinguishes the durable source of truth from fast in-memory and resync representations, and that correctly handles every termination condition — including whichever draw rule needs more than the current position.
Server-authoritative move validation with a precise, well-ordered step sequence, and a clear position on how legality and the clock interact at the instant a flag could fall.
An authoritative clock model that derives live time rather than ticking, handles a flag fall when no move arrives, and — critically — survives a process crash so a fresh owner recomputes the right remaining time; bonus for reasoning rigorously about which time source is trustworthy across a crash.
A credible reconnect & recovery story covering gap detection, catch-up vs. fresh snapshot, safe move retries, and how a failed server-side owner is rebuilt (state and clock), with honesty about where the recovered clock could be off.
A scaling strategy that keeps each game correct and ordered while millions run in parallel — partitioning, connection fan-out, hot/cold storage, and regional placement — plus the cross-region and spectator follow-ups.
Explicit edge cases and tradeoffs , stated as costs the candidate has reasoned about rather than glossed (e.g. the price of the durability boundary, the limits of their ordering choice, residual time-source risk, lag compensation as a policy decision).

Follow-up Questions

How do you keep a player from being charged clock time for the network round-trip of delivering their move? Sketch a concrete lag-compensation policy.
A worker holding 50k games crashes. Walk through exactly how a new owner rebuilds each game's position and the active player's remaining clock — and where the recovered clock could be off.
Two players are in different regions. Where do you place the game's authoritative owner, and how do you keep both players' move latency acceptable?
How would you extend the design to support spectators (thousands watching one popular game) without slowing the players' move path?

Design a Real-Time Chess Service

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design a Real-Time Chess Service

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP