Design real-time delivery tracking system

Q: Design real-time delivery tracking system

An open-ended Motive system-design interview question: design a real-time courier delivery-tracking platform covering mobile GPS publishing, streaming ingestion and map matching, route/ETA computation and multi-stop optimization, the order/assignment lifecycle, live and historical map rendering, scale, privacy, and failure recovery. Includes a combined reference solution and evaluation rubric.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Q: What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Onsite rounds at Motive.

Q: What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Motive during technical interviews.

Question

Question

Design a real-time delivery / courier-tracking platform that gives dispatchers, operations, and customers live visibility into each courier's route, position, and ETA.

The platform must cover the full lifecycle end-to-end: mobile GPS publishing from courier apps, ingestion and stream processing, route and ETA computation, the order/assignment lifecycle, live and historical map rendering, plus the cross-cutting concerns of scale, privacy, and resilience. The system serves three distinct audiences with very different needs:

Couriers — drivers who publish GPS while on shift and receive task assignments and route updates.
Dispatchers / operations — internal users who watch the whole fleet live, plan multi-stop routes, and handle exceptions.
Customers — external users who watch a single delivery (courier position, route, ETA) via a shareable link.

Work through the problem in the parts below.

Constraints & Assumptions

Scale from tens of thousands of concurrent on-shift couriers (assume a peak of ~150k, provision for ~200k) up to millions of deliveries per day (assume ~2M/day, ~45 min active each).
The live map can be loosely fresh : a 1-2 s lag between a GPS fix and the dot moving on screen is acceptable.
Orders and courier assignments must be strict : no double-assignment, no lost or duplicated orders.
Mobile clients run on cellular networks with frequent dropouts and tight battery budgets.
Privacy and access control matter: customers see minimal data, couriers control on/off-shift, and traces have retention limits.
The system must survive mobile-network gaps, the loss of an availability zone, and degradation of individual dependencies.
Treat payments, the consumer marketplace/matching, in-app chat, and proof-of-delivery media pipelines as out of scope except where they touch the data model.

Clarifying Questions to Ask

These are scoping questions to raise up front, before designing any part:

Vehicle profile: commercial trucks/vans (height/weight/turn restrictions) or bikes/pedestrians? This changes the routing engine profile and map-matching tolerances.
Freshness SLO: what is the acceptable end-to-end "GPS fix → visible on map" latency, and the "last-seen freshness" budget for an on-shift courier?
Geographic footprint: single region, multi-region, or global active-active? This drives replication, data residency, and routing-engine deployment.
Multi-stop reality: is dispatch primarily single pickup→dropoff, or genuine multi-stop routes per courier (which pulls in VRP/TSP planning)?
Retention & compliance: how long must precise traces be kept, and which privacy regimes (GDPR/CPRA) apply to redaction and deletion?
ETA accuracy bar: what error tolerance is acceptable for customer-facing ETAs, and is there a separate (tighter) bar for dispatch planning?

Part 1 — Service architecture

Lay out the overall architecture: clients, edge/gateway, core services, the event bus, and the datastores. Justify a streaming / event-driven approach, and explain where (and why) a CQRS split between the write path and read projections is appropriate.

What This Part Should Cover

A clear topology: clients → edge/gateway (TLS, auth, WAF, rate-limit) → core services → event bus → datastores, with a named transactional store separate from the time-series/hot stores.
A defensible justification for event-driven + CQRS rooted in the write/read asymmetry and projection rebuildability, not as buzzwords.
A sensible service decomposition (ingest, location processing, geofence, routing, order, assignment, live-state/fan-out, query/history) with single responsibilities.

Part 2 — Mobile GPS publishing, offline mode & battery

Describe how courier mobile apps collect and publish GPS updates — sampling strategy, batching, compression, sequence numbering, and authentication/anti-spoofing. Then address offline mode and battery constraints: how the client behaves through network dropouts without losing data, and how it keeps GPS power draw sustainable across a full shift.

What This Part Should Cover

Adaptive sampling tied to motion/activity, with batching + compression and a user-controlled on-shift toggle.
Per-device monotonic sequence numbering and its role in idempotency/ordering.
Authentication (device identity) plus server-side plausibility checks as the real anti-spoofing defense (not attestation alone).
A store-and-forward offline buffer with bounded capacity and ordered, deduped replay; graceful UX (last-known + confidence) during gaps.

Part 3 — Ingestion & stream processing

Explain how the backend ingests location data and processes it: validation and deduplication, map matching, smoothing, geofencing, and fan-out of updates. Be specific about delivery semantics and ordering.

What This Part Should Cover

A processing topology with idempotent consumers: validate/dedup (monotonic (courier_id, seq_no) ), smooth, map-match, geofence, fan-out.
Per-courier ordering via partition key; explicit at-least-once + idempotency-by-key delivery semantics.
Map matching as a probabilistic snap-to-road, smoothing (e.g. Kalman) for jitter, and geofencing with hysteresis.

Part 4 — Route & ETA computation

Explain how routes and ETAs are computed and updated incrementally for the live leg and for future legs, and how multi-stop route optimization (VRP / TSP with time windows) is solved for dispatch planning.

What This Part Should Cover

Incremental live-leg ETA from current segment speed, plus future-leg estimates that include per-stop service time.
ETA stability (smoothing) and sensible re-route triggers (off-route deviation, ETA drift, closures, inserted stops).
Multi-stop optimization framed as VRPTW (NP-hard) solved with construction + local-search + metaheuristic under a time budget; the single-courier case as TSP-with-time-windows; incremental re-optimization to avoid churning en-route drivers.

Part 5 — Order & assignment lifecycle

Define the APIs and data model for creating orders, assigning/unassigning couriers, and transitioning order state. The central correctness requirement is avoiding double-assignment under concurrent dispatchers (or a human plus an auto-dispatcher).

What This Part Should Cover

An order state machine and the assign/unassign/status-transition APIs.
A double-assignment defense: partial unique index + optimistic concurrency (version) + fencing token, with a clear story for concurrent writers.
Reliable DB↔bus consistency (transactional outbox) and idempotent mutations ( Idempotency-Key ).

Part 6 — Map rendering & live updates

Explain how live and historical routes are delivered to dispatcher consoles and customer tracking pages. Cover the transport (WebSocket/SSE), vector tiles, polyline simplification, and zoom-aware throttling, plus how historical playback is served.

What This Part Should Cover

Per-entity (courier/order/region) WebSocket/SSE channels gated by access policy, with client-side interpolation between coalesced pushes.
Zoom-aware throttling of both update rate and geometry detail; polyline simplification (e.g. Douglas-Peucker) and vector tiles for the base map.
Historical playback served from the time-series store (stitched, simplified path + stats).

Part 7 — Data storage & APIs

Describe the storage tiers: hot last-known state, time-series traces, and cold archival for historical playback — matched to their access patterns and retention. Then specify the public API surface, including the order/assignment endpoints, location ingest, live and historical reads, and streaming subscriptions (e.g. POST /orders, POST /orders/{id}/assign, POST /couriers/{id}/locations, GET /couriers/{id}/trace, GET /orders/{id}/eta, GET /orders/{id}/live).

What This Part Should Cover

A hot / warm / cold tiering (e.g. in-memory last-known, wide-column time-series with a TTL, object-storage Parquet for cold) keyed by partition design and retention, with transactional order data kept separate.
A coherent REST + streaming API surface covering ingest, strongly-consistent order/assignment mutations (with idempotency), live reads, historical reads (paginated, time-range capped), and subscriptions.

Part 8 — Scale, capacity & consistency trade-offs

Produce credible back-of-the-envelope estimates for throughput, storage, and connection counts at the target scale. Then explain the latency-vs-consistency trade-offs — specifically the deliberate split between strong consistency for order/assignment and eventual consistency for the live map.

What This Part Should Cover

Numeric estimates with stated assumptions: updates/s, ingress MB/s, daily event volume, hot/warm storage/day, and peak concurrent WebSocket connections (with a node-count sanity check).
An explicit two-domain consistency model: transactional strong consistency for order/assignment (with a latency budget) vs. bounded-staleness eventual consistency for the live map, plus per-courier ordering and idempotency guarantees.

Part 9 — Privacy & access control

Explain the privacy and access-control model: data minimization, RBAC/ABAC, ephemeral tracking links, location blurring/redaction, and retention.

What This Part Should Cover

Data minimization (customers see blurred/coarse position + ETA for one order; off-shift location hidden; sensitive-POI redaction).
RBAC/ABAC scoping by role + region + relationship-to-order, and ephemeral signed tracking links that expire.
Retention limits, deletion/erasure handling, and encryption/audit for sensitive access.

Part 10 — Failure handling, disaster recovery & monitoring

Explain how the system handles failures and stays observable: client-side resilience, server-side fault tolerance, graceful degradation of dependencies, disaster recovery (RPO/RTO), and monitoring/alerting (SLOs, golden signals, synthetic probes).

What This Part Should Cover

Client resilience (retry/backoff, durable queue, circuit breaker) and server fault tolerance (multi-AZ, DLQ for poison events, log replay to rebuild projections, outbox for reliable emission).
Per-dependency graceful degradation with named fallbacks.
DR with stated RPO/RTO and a rebuild story; SLO-driven monitoring on golden signals + write-to-map latency, plus synthetic end-to-end probes.

What a Strong Answer Covers

Across all parts, a strong answer keeps two worlds cleanly separated and reasons about the trade-offs that span them:

The firehose vs. the money path. It threads a consistent story — append-only, partition-by-courier, idempotent-by- (courier_id, seq_no) , CQRS read projections that are rebuildable from the log — through ingestion, storage, fan-out, and failure recovery, while keeping orders/assignments strongly consistent end-to-end.
Numbers that constrain decisions. Capacity estimates and SLOs are used to justify choices (why a wide-column time-series, why N WebSocket nodes, why eventual consistency on the map), not stated in isolation.
Commercial-fleet framing. Given the fleet/telematics context, it reasons about a commercial-vehicle routing profile and dispatcher-operations workflows rather than a generic consumer app.
Privacy and resilience as first-class , not bolted on: data minimization and ABAC scoping, store-and-forward + DLQ/replay + graceful degradation, and observability tied to the user-visible latency promise.

Follow-up Questions

A new high-priority order arrives mid-shift for a courier already en route on a planned multi-stop route. How do you insert the stop without re-shuffling — and frustrating — the entire remaining route?
A bad app release starts sending spoofed/teleporting GPS for thousands of couriers. How do your plausibility checks, idempotency, and replay-from-log let you detect, contain, and recover the projections?
The ETA model's error degrades sharply in one city during a holiday. How do you detect this from monitoring, and how do you safely roll a new map-matching/ETA model before cutover?
Walk through what happens to the live map and the order path when an entire availability zone is lost. Which guarantees hold, which degrade, and what does the customer see?

Design real-time delivery tracking system

Quick Overview