Design driver heat map and discuss consensus
Real-Time Driver Heat Map with Top-K Busiest Cells, Plus Paxos vs. Raft
Context
You are designing a real-time heat map for a ride-hailing platform. Driver apps continuously send location updates. Rider apps (and internal tools) need to visualize where drivers cluster and fetch the top-K busiest cells within a map region/zoom, with low latency.
Requirements
-
Real-time ingestion and streaming via WebSocket.
-
Partition the map into deterministic cell IDs (e.g., S2/Geohash) that align with map zoom levels.
-
Track driver density per cell over a recent time window (e.g., last 1–5 minutes) and serve the top-K busiest cells for a requested viewport.
-
Reasonable scale assumptions (tune as needed):
-
1–5 million active drivers globally.
-
Each driver sends a location every 2–5 seconds.
-
End-to-end latency: P95 <= 1–2 seconds from driver update to client-visible heat change.
-
High availability across regions; horizontal scalability.
Deliverables
-
Design the system: ingestion, partitioning, storage, algorithms to maintain and query top-K per region/viewport, and how WebSocket streaming is used.
-
Explain Paxos and Raft consensus algorithms and highlight key differences.
Constraints & Assumptions
-
Preserve the scope, facts, inputs, and requested outputs from the prompt above.
-
If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
-
Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.
Clarifying Questions to Ask
-
Clarify users, core use cases, read/write patterns, scale, latency, availability, and data retention.
-
State explicit assumptions before making sizing or architecture decisions.
-
Prioritize the functional path first, then address reliability, security, observability, and rollout.
What a Strong Answer Covers
-
A scoped requirements summary with concrete non-goals and success metrics.
-
API, data model, architecture, consistency, capacity, and operations.
-
Reasoned trade-offs among simple and scalable designs, including bottlenecks and failure modes.
-
A validation, monitoring, migration, and launch plan appropriate for the risk level.
Follow-up Questions
-
What breaks first at 10x traffic or data volume?
-
How would you degrade gracefully during dependency failures?
-
What metrics and alerts would prove the design is healthy after launch?