System Design: Real-Time Delivery Operations Dashboard
Context
Design a real-time operations dashboard for a two-sided delivery platform. The dashboard is used by internal ops to monitor live driver locations, order statuses, ETAs, and aggregate metrics. Assume peak scale up to 100k concurrent drivers, 1–2 location updates per driver per second, traffic spikes around mealtimes (2–3×), and multi-region deployment.
Requirements
Describe your design across these dimensions:
-
Data sources
-
Mobile apps (driver, consumer), and backend services (orders, dispatch/assignment, ETA service).
-
Ingestion and transport
-
Event streaming vs. webhooks; protocols (gRPC/HTTP), serialization; schema evolution.
-
Update frequency and latency targets
-
Target update rates for locations and status changes; end-to-end latency SLOs/SLA.
-
Storage models
-
Hot (in-memory), warm, and cold stores; data retention and TTL.
-
Indexing and data access
-
Keys and geospatial indexing for fast viewport queries; secondary indexes for driver/order lookups.
-
Frontend APIs
-
Real-time push (WebSockets/SSE) and snapshot REST APIs; filtering by region/viewport.
-
Caching
-
Server- and client-side caching; tile/viewport-level aggregation.
-
Deduplication and out-of-order handling
-
Idempotency keys, sequence numbers, watermarking/windows, last-write-wins rules.
-
Consistency vs. availability
-
Which views require strong vs. eventual consistency; fallback behaviors during partitions.
-
Partitioning and scaling
-
Topic/table sharding keys; handling traffic spikes and hotspots.
-
Fault tolerance and backpressure
-
Replay, retries, batching, load shedding, and graceful degradation.
-
Monitoring and alerting
-
SLOs, lag metrics, error budgets, traces; dashboards and alerts.
-
Cost controls
-
Downsampling, compression, TTLs, autoscaling, multi-tier storage.
-
Performance
-
How you would optimize computation time complexity and end-to-end latency for real-time views.