PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/OpenAI

Design a Real-Time Chess Service

Last updated: Jun 17, 2026

Quick Overview

This question evaluates proficiency in designing real-time, authoritative backend systems with emphasis on state representation and persistence, strict move validation and ordering, authoritative clock management, failure recovery, and preserving per-game correctness at scale under low-latency constraints.

  • medium
  • OpenAI
  • System Design
  • Backend Engineer

Design a Real-Time Chess Service

Company: OpenAI

Role: Backend Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

Design an online chess service that supports **real-time, two-player games**. Users should be able to create or join a game, make legal moves, and see their opponent's moves with low latency. Games end by **checkmate, timeout (flag fall), resignation, draw** (agreement, stalemate, threefold repetition, fifty-move rule, or insufficient material), or **abandonment**. The server is the authority on rules, move ordering, and the clock — clients are never trusted. Your design should address, in particular: 1. **State representation & persistence** — how you model and store a game's position and history. 2. **Move validation & ordering** — how the server rejects illegal/out-of-turn moves and keeps the two players' views totally consistent. 3. **Clock accuracy** — how each player's remaining time is tracked authoritatively, including a timeout that occurs while a player is simply thinking. 4. **Failure recovery** — what happens when a client disconnects, crashes, or reconnects from a different device, and when a server-side game owner fails. 5. **Scaling** — how the service handles millions of concurrent games while preserving per-game correctness. The system must feel real-time: a move should typically reach the opponent in well under a few hundred milliseconds. State any additional assumptions you make. ```hint Where to start Before drawing boxes, characterize the workload of a *single* game: how many messages per minute, how strict is the ordering, how latency-sensitive is each one? Then multiply by the concurrency target. The shape of the answer — what's small per game vs. what's enormous across games — should tell you where the real engineering problem is, and which of your design choices that problem ought to drive. ``` ```hint The key simplifier For each turn of one game, what is the smallest unit you'd be willing to let process commands one at a time? If you pin that responsibility somewhere specific, validation, turn order, and clock math might stop needing distributed locks or cross-node transactions. What's the unit, and where does it live? ``` ```hint State representation You probably don't want one board encoding for everything — the cheapest *durable write*, the fastest *in-memory legality check*, and the most convenient *snapshot/resync* format may not be the same thing. Sketch what each layer wants. Then ask: among the draw rules, is there one that depends on *history* rather than the current position — and if so, what minimal extra bookkeeping covers it? ``` ```hint Clocks Storing a per-second countdown is wasteful and drifts under jitter. What's the minimal set of fields from which you could *derive* a player's live remaining time on demand? Once you have that, two cases get interesting: nobody ever moves and a clock simply runs out, and the owning process dies mid-turn — what has to be true about the timestamp you stored for a fresh owner to recompute the clock correctly? ``` ```hint Reliability of a move Decide your durability boundary: at what exact point is a move safe enough to tell the mover "it counted"? Then think about the retry that follows a crash — what could the client attach to a move so the server can recognize a duplicate and apply it no more than once? ``` ### Constraints & Assumptions State your own; a reasonable set: - **Concurrency**: design for a peak of ~$1\text{M}$ concurrent games $\Rightarrow$ ~$2\text{M}$ live socket connections. - **Per-game traffic**: a blitz game is ~40 moves over a few minutes $\approx 0.1$ moves/sec/game — latency-sensitive, not throughput-bound. - **Latency target**: p99 move round-trip (mover → server → opponent) under ~200 ms within a region. - **Correctness (strong, per game)**: moves are totally ordered; no illegal or out-of-turn move is ever accepted; an acknowledged move is never lost; clocks never silently go negative. - **Time controls**: support base time + optional per-move increment (e.g. `5+3` = 5 min + 3 s/move) and delay variants; the **server** is the single source of truth for remaining time. - **Out of scope** (leave room, but don't design): anti-cheat / engine detection, rating math (Elo/Glicko), spectators, chat, tournaments, puzzles. ### Clarifying Questions to Ask - What time controls must we support (bullet/blitz/rapid/classical, increment vs. delay), and is **lag compensation** a product requirement? - How strict is the latency SLA, and is play single-region or must we handle two players in **different regions**? - On a disconnect, does the absent player's clock keep running, and how long is the **grace period** before abandonment is declared? - Must a player be able to connect from **two devices at once** (e.g. phone + laptop), and do we need spectator/replay support now or later? - What scale are we targeting (peak concurrent games), and what durability guarantee does an accepted move need? - Are draws by threefold repetition / fifty-move **auto-applied** by the server or **claimed** by a player? ### What a Strong Answer Covers - A clear **functional / non-functional requirements** split and back-of-the-envelope sizing, with a defensible conclusion about *where the real bottleneck is* — and design choices that follow from it rather than generic boxes. - A coherent **per-game ordering strategy** that makes validation, turn order, and clock math race-free, with a clear argument for why it avoids distributed locks or cross-node transactions (any defensible mechanism, not one specific blessed answer). - A concrete **state/persistence model** that distinguishes the durable source of truth from fast in-memory and resync representations, and that correctly handles every termination condition — including whichever draw rule needs more than the current position. - **Server-authoritative move validation** with a precise, well-ordered step sequence, and a clear position on how legality and the clock interact at the instant a flag could fall. - An **authoritative clock model** that derives live time rather than ticking, handles a flag fall when no move arrives, and — critically — survives a process crash so a fresh owner recomputes the right remaining time; bonus for reasoning rigorously about which time source is trustworthy across a crash. - A credible **reconnect & recovery** story covering gap detection, catch-up vs. fresh snapshot, safe move retries, and how a failed server-side owner is rebuilt (state *and* clock), with honesty about where the recovered clock could be off. - A **scaling strategy** that keeps each game correct and ordered while millions run in parallel — partitioning, connection fan-out, hot/cold storage, and regional placement — plus the cross-region and spectator follow-ups. - Explicit **edge cases and tradeoffs**, stated as costs the candidate has reasoned about rather than glossed (e.g. the price of the durability boundary, the limits of their ordering choice, residual time-source risk, lag compensation as a policy decision). ### Follow-up Questions - How do you keep a player from being charged clock time for the **network round-trip** of delivering their move? Sketch a concrete lag-compensation policy. - A worker holding 50k games crashes. Walk through exactly how a new owner rebuilds each game's position **and** the active player's remaining clock — and where the recovered clock could be off. - Two players are in different regions. Where do you place the game's authoritative owner, and how do you keep *both* players' move latency acceptable? - How would you extend the design to support **spectators** (thousands watching one popular game) without slowing the players' move path?

Quick Answer: This question evaluates proficiency in designing real-time, authoritative backend systems with emphasis on state representation and persistence, strict move validation and ordering, authoritative clock management, failure recovery, and preserving per-game correctness at scale under low-latency constraints.

Related Interview Questions

  • Design Video Generation Orchestration - OpenAI (medium)
  • Design CI/CD Build Caching - OpenAI
  • Design an Instagram-like Feed System - OpenAI (medium)
  • Design Online Chess Matchmaking - OpenAI (hard)
  • Design Android MVVM API Architecture - OpenAI (medium)
OpenAI logo
OpenAI
Apr 22, 2026, 12:00 AM
Backend Engineer
Technical Screen
System Design
14
0

Design an online chess service that supports real-time, two-player games.

Users should be able to create or join a game, make legal moves, and see their opponent's moves with low latency. Games end by checkmate, timeout (flag fall), resignation, draw (agreement, stalemate, threefold repetition, fifty-move rule, or insufficient material), or abandonment. The server is the authority on rules, move ordering, and the clock — clients are never trusted.

Your design should address, in particular:

  1. State representation & persistence — how you model and store a game's position and history.
  2. Move validation & ordering — how the server rejects illegal/out-of-turn moves and keeps the two players' views totally consistent.
  3. Clock accuracy — how each player's remaining time is tracked authoritatively, including a timeout that occurs while a player is simply thinking.
  4. Failure recovery — what happens when a client disconnects, crashes, or reconnects from a different device, and when a server-side game owner fails.
  5. Scaling — how the service handles millions of concurrent games while preserving per-game correctness.

The system must feel real-time: a move should typically reach the opponent in well under a few hundred milliseconds. State any additional assumptions you make.

Constraints & Assumptions

State your own; a reasonable set:

  • Concurrency : design for a peak of ~ 1M1\text{M}1M concurrent games ⇒\Rightarrow⇒ ~ 2M2\text{M}2M live socket connections.
  • Per-game traffic : a blitz game is ~40 moves over a few minutes ≈0.1\approx 0.1≈0.1 moves/sec/game — latency-sensitive, not throughput-bound.
  • Latency target : p99 move round-trip (mover → server → opponent) under ~200 ms within a region.
  • Correctness (strong, per game) : moves are totally ordered; no illegal or out-of-turn move is ever accepted; an acknowledged move is never lost; clocks never silently go negative.
  • Time controls : support base time + optional per-move increment (e.g. 5+3 = 5 min + 3 s/move) and delay variants; the server is the single source of truth for remaining time.
  • Out of scope (leave room, but don't design): anti-cheat / engine detection, rating math (Elo/Glicko), spectators, chat, tournaments, puzzles.

Clarifying Questions to Ask

  • What time controls must we support (bullet/blitz/rapid/classical, increment vs. delay), and is lag compensation a product requirement?
  • How strict is the latency SLA, and is play single-region or must we handle two players in different regions ?
  • On a disconnect, does the absent player's clock keep running, and how long is the grace period before abandonment is declared?
  • Must a player be able to connect from two devices at once (e.g. phone + laptop), and do we need spectator/replay support now or later?
  • What scale are we targeting (peak concurrent games), and what durability guarantee does an accepted move need?
  • Are draws by threefold repetition / fifty-move auto-applied by the server or claimed by a player?

What a Strong Answer Covers

  • A clear functional / non-functional requirements split and back-of-the-envelope sizing, with a defensible conclusion about where the real bottleneck is — and design choices that follow from it rather than generic boxes.
  • A coherent per-game ordering strategy that makes validation, turn order, and clock math race-free, with a clear argument for why it avoids distributed locks or cross-node transactions (any defensible mechanism, not one specific blessed answer).
  • A concrete state/persistence model that distinguishes the durable source of truth from fast in-memory and resync representations, and that correctly handles every termination condition — including whichever draw rule needs more than the current position.
  • Server-authoritative move validation with a precise, well-ordered step sequence, and a clear position on how legality and the clock interact at the instant a flag could fall.
  • An authoritative clock model that derives live time rather than ticking, handles a flag fall when no move arrives, and — critically — survives a process crash so a fresh owner recomputes the right remaining time; bonus for reasoning rigorously about which time source is trustworthy across a crash.
  • A credible reconnect & recovery story covering gap detection, catch-up vs. fresh snapshot, safe move retries, and how a failed server-side owner is rebuilt (state and clock), with honesty about where the recovered clock could be off.
  • A scaling strategy that keeps each game correct and ordered while millions run in parallel — partitioning, connection fan-out, hot/cold storage, and regional placement — plus the cross-region and spectator follow-ups.
  • Explicit edge cases and tradeoffs , stated as costs the candidate has reasoned about rather than glossed (e.g. the price of the durability boundary, the limits of their ordering choice, residual time-source risk, lag compensation as a policy decision).

Follow-up Questions

  • How do you keep a player from being charged clock time for the network round-trip of delivering their move? Sketch a concrete lag-compensation policy.
  • A worker holding 50k games crashes. Walk through exactly how a new owner rebuilds each game's position and the active player's remaining clock — and where the recovered clock could be off.
  • Two players are in different regions. Where do you place the game's authoritative owner, and how do you keep both players' move latency acceptable?
  • How would you extend the design to support spectators (thousands watching one popular game) without slowing the players' move path?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More OpenAI•More Backend Engineer•OpenAI Backend Engineer•OpenAI System Design•Backend Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.