Scaling a Collision-Free Rover Fleet to a Million Robots
Company: Shopify
Role: Software Engineer
Category: System Design
Interview Round: Technical Screen
In an earlier round you built a single-process simulator that moves a handful of rovers across a shared grid by applying `L`/`R`/`M` commands, where no two rovers may occupy the same cell. Now design the backend service that runs this simulation at scale.
You operate a fleet of **1,000,000+ active rovers** moving on a large shared 2D grid (a warehouse-style floor map). Each rover streams in movement commands (`L` = turn left, `R` = turn right, `M` = move forward one cell) that the service must apply to keep an authoritative, **collision-free** view of every rover's position and heading. Clients (dashboards, routing/planning services, and the rovers themselves) need to read current positions and be told whether a requested move succeeded or was blocked.
Design the architecture: how commands are ingested and applied, how rover and grid state are stored and sharded, how you keep movement collision-free across machines, and how the system stays correct and available under failures.
### Constraints & Assumptions
State your own where the prompt is silent; reasonable defaults:
- **Scale:** 1M–5M active rovers. Grid on the order of $10^4 \times 10^4$ cells (~$10^8$ cells), large enough that rovers are sparse on average but can cluster (aisles, charging stations, doorways).
- **Throughput:** each rover issues up to a few commands/second → peak on the order of **1–5M command-applications/second**, bursty around shift changes and congestion points.
- **Latency:** apply a command and acknowledge success/blocked within ~tens of milliseconds (p99). Position reads should be near-real-time.
- **Invariant:** at most one rover per cell at any instant; an `M` into an occupied or off-grid cell is rejected (rover stays put, ack says "blocked").
- **Durability:** a rover's authoritative position must survive process/host failure; the system must not "lose" or "teleport" rovers.
- A rover applies its own commands in order; different rovers are independent except when they contend for the same cell.
### Clarifying Questions to Ask
- Is the grid static, or do obstacles / walls / no-go zones change at runtime?
- Are commands the source of truth, or is there a higher-level path planner I must coordinate with (i.e., can I reject/replan, or only accept/block)?
- What is the read/write ratio, and who reads — only dashboards, or do rovers poll neighbors' positions to plan locally?
- When an `M` is blocked, is the desired behavior "drop and continue with the next command" (as in the coding round) or "retry until clear"?
- What are the durability/consistency expectations — is a few-hundred-ms-stale position read acceptable for dashboards while writes stay strongly consistent?
- Multi-region, or single datacenter (does grid geography map to physical sites)?
### Part 1 — Ingestion and state model
Design how commands flow into the system and where rover/grid state lives. Specify the data model for a rover and for cell occupancy, the partitioning/sharding scheme, and how you sustain millions of command-applications per second without losing acknowledgements.
```hint Where to start
Separate the durable log of incoming commands from the materialized "current state." A partitioned append-only stream (keyed for ordering) in front of stateful workers lets you absorb bursts and replay.
```
```hint Sharding axis
You have two natural keys: rover id and grid region. Think about which one each piece of state should be sharded by, given that the collision invariant is about *cells*, not rovers.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Collision-free movement across machines
The "one rover per cell" invariant is global, but state is spread over many machines. Design how a move is decided and committed so two rovers can never end up in the same cell, including when the two contending rovers (or the source and destination cells) live on different shards. Address moves that cross a partition boundary.
```hint Make the cell the authority
Give exactly one owner the right to grant a cell. If a single shard owns a contiguous grid region, most moves are local and serialize cheaply within that owner; only boundary-crossing moves need cross-shard coordination.
```
```hint Cross-boundary moves
A move from region A into region B touches two owners. Reach for a small, well-understood commit protocol (reserve destination, then release source) and think about ordering reservations to avoid deadlock and about what happens if step two fails.
```
#### Clarifying Questions for this Part
- Is it acceptable for a blocked move to occasionally be retried later, or must every command get a single definitive accept/block decision?
- Can a rover be momentarily "in flight" between two cells, or must occupancy flip atomically?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Consistency, fault tolerance, and observability
Design for failure. Cover exactly-once application of `M` (so a retried command does not move a rover twice), recovery of a shard owner after a crash without losing or duplicating rovers, and the signals you'd monitor to catch correctness or capacity problems.
```hint Idempotency
Attach a monotonic sequence number or command id per rover and persist the last-applied id with the rover state, so replays after a crash are no-ops.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- A single doorway cell becomes a severe hotspot — thousands of rovers per minute funnel through it. How does your Part 2 design behave, and how would you mitigate the contention without violating the invariant?
- The grid is re-partitioned at runtime (you split a hot region into two shards). How do you migrate ownership and in-flight reservations without dropping the invariant or stalling movement?
- Suppose a small amount of position staleness for *reads* is acceptable but writes must stay strongly consistent. How would you serve 100k+ dashboard/query reads per second cheaply without overloading the authoritative shards?
- How would your design change if rovers could request multi-cell moves or reserve a short path ahead (lookahead) instead of one cell at a time?
Quick Answer: This system design question evaluates a candidate's ability to scale a stateful, invariant-preserving service across distributed machines, using a shared-grid rover simulation as the concrete scenario. It tests sharding strategy, cross-shard coordination, and fault-tolerant state management, commonly probed in senior backend interviews to assess distributed systems reasoning beyond basic CRUD scaling.