This question evaluates a candidate's ability to design real-time distributed systems for geospatial aggregation, covering skills in spatial partitioning, high-throughput streaming ingestion, low-latency cell counting, WebSocket-based client updates, and top-K ranking within a fault-tolerant multi-region architecture.
Design a real-time driver heatmap for a ride-hailing platform. Partition the city into grid cells (e.g., geohash), ingest driver pings, and stream updates to clients over WebSockets. Compute and publish the top-K hottest cells per region with low latency. Detail the data model, aggregation strategy (edge vs. server), update frequency, scalability, fault tolerance, and how clients subscribe/unsubscribe.
Quick Answer: This question evaluates a candidate's ability to design real-time distributed systems for geospatial aggregation, covering skills in spatial partitioning, high-throughput streaming ingestion, low-latency cell counting, WebSocket-based client updates, and top-K ranking within a fault-tolerant multi-region architecture.
System Design: Real-Time Driver Heatmap and Top-K Hottest Cells
Context
You are building a real-time driver heatmap for a ride-hailing platform that visualizes driver density across a city. The city is partitioned into spatial grid cells (e.g., geohash / H3 / S2). Driver apps send periodic GPS location pings, and the backend must aggregate these into live per-cell driver counts and stream updates to rider and ops clients over WebSockets.
In addition to the heatmap itself, the system must compute and publish the top-K hottest cells per region with low latency.
Constraints & Assumptions
Design for city-scale load and anchor your design to these numbers:
Active drivers:
hundreds of thousands (low-single-digit hundreds of thousands; e.g., ~300k).
Ping cadence:1–5 location pings per driver per minute.
End-to-end latency:sub-second to low-seconds
(ping → visible update on clients).
Deployment:multi-region
.
Workload shape:
write-heavy ingest from drivers; read-heavy, fan-out-heavy consumption from many rider/ops clients.
Accuracy class:
this is a
visualization
surface, not billing — small transient inaccuracy is acceptable, but
persistent
drift is not.
State any additional assumptions you make explicitly before you start designing.
Clarifying Questions to Ask
Before diving in, confirm the scope with the interviewer:
What does
"active driver"
mean precisely — pinged within some TTL, currently online, or available-for-dispatch? Does each driver count in exactly one cell?
What is the
read:write ratio
and the expected number of concurrent
viewers
(rider apps + ops dashboards) per region?
How
fresh
must the heatmap be — is a sub-second tail required, or is low-seconds acceptable, and what is the latency budget broken down across ingest vs. fan-out?
What
zoom levels / cell resolutions
must the client support, and does the viewport change frequently (panning/zooming)?
How large is
K
, and how is a
"region"
defined (administrative boundary vs. a coarser grid resolution)?
What are the
consistency requirements
— is convergent/eventually-correct counting acceptable, or must counts be exact at every instant?
What a Strong Answer Covers
The interviewer is looking for these signals (these are dimensions to address, not the answers):
A
capacity estimate
: pings/s, count-changing events/s, in-memory state size, and connection/fan-out volume — and noticing which of these is the real bottleneck.
A clear
data model
for pings, per-driver presence, per-cell counts, regional top-K, and subscriptions.
A principled choice of
spatial index
with justified tradeoffs.
A separation of the
write path
(ingest → aggregate) from the
read/fan-out path
(stream → clients).
A coherent story for
where aggregation happens
(edge vs. server) and
why
.
Scalability
reasoning: partitioning keys, hot-cell/hot-partition handling, and WebSocket fan-out that scales with connections.
Fault tolerance & correctness invariants
that survive retries, reconnects, and node failures (idempotency, ordering, replay).
Sensible
tradeoffs
stated explicitly, with the candidate able to defend each choice.
Part 1 — Spatial partitioning & data model
Partition the city into fixed-size grid cells (e.g., geohash / H3 / S2) and define the data model: the entities and state you maintain for pings, per-driver presence, per-cell counts, regional top-K, and subscriptions.
Part 2 — Ingest & aggregation (edge vs. server)
Ingest driver pings and compute the count of active drivers per cell in near real time. Address where the work happens (edge vs. server) and how you turn raw pings into per-cell counts. Define what "active" means (e.g., a TTL).
Part 3 — Streaming to clients & subscription management
Stream cell updates to clients over WebSockets, including subscribe / unsubscribe flows. Cover update frequency (ping cadence and how/when updates are pushed) and client subscription management — how clients subscribe, unsubscribe, and handle viewport changes and reconnection.
Part 4 — Top-K hottest cells per region
Compute and publish the top-K hottest cells per region with low latency, updating as counts change.
Follow-up Questions
Be ready for the interviewer to push deeper:
A downtown rush-hour cell becomes a hot partition.
How do you shard a single hot cell across multiple sub-partitions so a driver's
+1
(enter) and later
−1
(move-out/expiry) still reconcile correctly? What property must the sub-key have, and what goes wrong if you pick it naively (e.g., round-robin)?
What is your correctness story under at-least-once delivery and reconnects?
State the invariants that prevent double-counting a move, and explain whether counts are atomic at every instant or merely convergent.
How does this change at 10–100x scale
(a 1M-driver mega-region, or 1M concurrent viewers)? What is the
first
thing that breaks, and is it on the ingest path or the fan-out path?
GPS jitter at a cell boundary
makes a stationary driver flip-flop
±1
. How do you stop the resulting churn without hurting freshness?