Design a Real-Time Driver Heatmap (500k events/sec)
You are designing a real-time heatmap service for a ride‑hailing app. Mobile clients send frequent location pings. The backend partitions the map into fixed grid cells (cell_id) and must maintain per-cell counts and top‑K busiest cells.
Assume a fixed grid (for example, S2 or geohash at a level that produces 200–500m cells), mobile pings include device_id, lat/lon, event_time, and you can append server_receive_time.
Functional Requirements
-
Ingest 500k location events/sec from mobile clients.
-
Maintain per-cell counts with 1-second cadence.
-
Compute and publish every second:
-
Global top‑K busiest cells.
-
Per‑area top‑K busiest cells (e.g., city/region or map tiles).
-
Deliver updates to connected users via WebSockets.
Non-Functional Requirements
-
Ingestion: load balancing, backpressure, and deduplication.
-
State store: in‑memory for low latency and durable for recovery (exactly-once if feasible).
-
Windowing and late data handling using event time and watermarks.
-
Scalable top‑K computation across partitions.
-
Subscription management for per‑area updates (e.g., viewport/city/tile).
-
Fault tolerance across AZs/regions; graceful degradation.
-
Cost controls and autoscaling.
What to Deliver
Describe the end‑to‑end system design and choices for:
-
Ingestion pipeline and load shedding/backpressure.
-
Data partitioning strategy.
-
Stream processor architecture and state stores (in‑memory + durable).
-
Windowing, watermarks, and late data/corrections.
-
Global and per‑area top‑K algorithms at scale.
-
WebSocket delivery and subscription management.
-
Fault tolerance, recovery, and idempotency.
-
Cost levers and operational guardrails.