Compute nearest courier for each customer
Company: DoorDash
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
##### Question
You are given two sets of 2D points: customers `C = {c1..cn}` and active couriers (dashers) `D = {d1..dm}`. Each customer `i` has coordinates `(xi, yi)`. Each courier `j` has coordinates `(aj, bj)` and an integer `id`. For every customer, return the `id` of the nearest courier by distance; on ties (equal distance), return the smallest `id`.
Work through the following parts:
1. Implement a correct brute-force solution and state its time and space complexity.
2. Design and implement (or outline) an approach that beats the `O(n * m)` brute force when `n` and `m` are large (up to `1e5`). Discuss appropriate spatial data structures such as a k-d tree, R-tree, or uniform grid (spatial hashing), and analyze the time and space complexity of your chosen approach.
3. Specify which distance metric you assume — Euclidean on a plane vs. great-circle distance on latitude/longitude — and explain how the choice affects correctness and your data structures.
4. Handle the edge cases: tie-breaking by smallest `id`, duplicate courier locations, and floating-point precision issues when comparing distances.
5. Follow-ups:
- How does your approach change if courier locations update frequently (insert, delete, move)?
- How would you support large batches of customer queries efficiently?
- How would you enforce a constraint such as a maximum pickup radius (a customer with no courier within the radius returns "no courier")?
Quick Answer: A DoorDash software engineer onsite coding question on nearest-neighbor search: for every customer, find the nearest courier by distance, breaking ties by smallest id. It tests brute force vs. spatial indexing (uniform grid, k-d tree, R-tree), time/space complexity, Euclidean vs. great-circle distance, and follow-ups on dynamic courier updates, batch queries, and a maximum pickup radius.
Solution
## Setup and key decisions
This is a **nearest-neighbor (NN) search** problem: for each of `n` customers, find the closest of `m` couriers, breaking ties by smallest courier `id`.
**Distance metric.** Compare **squared** distances, never the raw distance, to avoid `sqrt` cost and the precision loss it introduces:
```
dist2(c, d) = (c.x - d.x)^2 + (c.y - d.y)^2
```
For real DoorDash geography the points are latitude/longitude and the correct metric is the **great-circle (haversine)** distance, not planar Euclidean. State your assumption explicitly. If the city is small you can project to a local planar frame (e.g. equirectangular with a `cos(latitude)` scale on longitude) and use Euclidean as a fast approximation; for correctness at scale use haversine. Most grid/k-d-tree structures still work because haversine is monotone in the chord length over a city-sized region.
## Part 1 - Brute force
For every customer, scan all couriers and keep the minimum, breaking ties by smallest `id`.
```python
def nearest_courier_bruteforce(customers, couriers):
# customers: list of (x, y)
# couriers: list of (a, b, id)
result = []
for (cx, cy) in customers:
best_d2 = None
best_id = None
for (ax, ay, cid) in couriers:
d2 = (cx - ax) ** 2 + (cy - ay) ** 2
if best_d2 is None or d2 < best_d2 or (d2 == best_d2 and cid < best_id):
best_d2 = d2
best_id = cid
result.append(best_id)
return result
```
- **Time:** `O(n * m)`.
- **Space:** `O(1)` extra (besides the `O(n)` output).
At `n = m = 1e5` this is `1e10` operations - far too slow. We need spatial indexing.
## Part 2 - Scalable approach
Build a spatial index over the **couriers** once, then query it for each customer.
**Option A - Uniform grid / spatial hashing (simplest, usually fastest in practice).**
1. Pick a cell size `s` near the typical NN distance, e.g. `s ≈ sqrt(area / m)` so each cell holds ~1 courier on average.
2. Bucket every courier into cell `(floor(a/s), floor(b/s))` using a hash map keyed by cell.
3. For a customer, search its own cell, then expand to surrounding rings of cells. Once you have a candidate at distance `r`, you only need to keep expanding until the ring's minimum possible distance exceeds `r`, then stop.
- **Time:** `O(m)` to build, expected `O(1)` per query for roughly uniform data, so `O(n + m)` overall. Degrades toward `O(m)` per query under heavy clustering.
- **Space:** `O(m)`.
**Option B - k-d tree.**
1. Build a 2-d k-d tree over the couriers: `O(m log m)`.
2. Each NN query is `O(log m)` average (`O(m)` worst case for adversarial/degenerate data).
- **Time:** `O(m log m)` build + `O(n log m)` queries.
- **Space:** `O(m)`.
**Option C - R-tree.** Better than a k-d tree when couriers are inserted/deleted dynamically and for range/radius queries; comparable query asymptotics, higher constants and implementation complexity.
**Recommendation:** a uniform grid for roughly uniform density (common within a city) because of its simplicity and `O(1)`-expected queries; a k-d tree when density varies widely; an R-tree when you need frequent dynamic updates and radius queries.
## Part 4 - Edge cases
- **Tie-breaking:** when two couriers are at equal (squared) distance, return the smaller `id`. With a grid or k-d tree, carry `(d2, id)` and compare lexicographically on `(d2, id)`; this guarantees deterministic output regardless of traversal order.
- **Duplicate locations:** multiple couriers at the same coordinates are fine - they fall in the same grid cell / tree leaf; tie-breaking by `id` selects deterministically.
- **Precision:** compare squared integer/Decimal distances when inputs are integers (exact). With floats, equal-distance ties are fragile; use an epsilon band (`abs(d2a - d2b) <= eps`) before applying the `id` tie-break, or scale to fixed-point. Avoid `sqrt` entirely except for the final reported value.
- **No couriers:** return a sentinel (e.g. `-1`/`None`) per customer.
## Part 5 - Follow-ups
**Frequent updates (insert / delete / move).** A static k-d tree is poor for updates (it gets unbalanced and needs periodic rebuilds). Prefer:
- A **uniform grid / spatial hash**: insert and delete are `O(1)`; a move is delete-from-old-cell + insert-into-new-cell. This is the standard real-time choice for moving fleets.
- An **R-tree** (or R*-tree) if you also need range queries with logarithmic updates.
- For very high update rates, consider geohash buckets in a key-value store (e.g. Redis `GEO` commands) so updates and radius queries are distributed.
**Batch queries.** When many customers query at once, bulk-process them: sort/bucket customers by grid cell so couriers loaded for one cell are reused across nearby customers (cache locality), or build a k-d tree over couriers and run queries in parallel - the index is read-only so queries shard cleanly across threads/workers.
**Maximum pickup radius `R`.** This becomes a bounded NN / radius query: only consider couriers within `R`. With a grid, restrict ring expansion to cells overlapping the disk of radius `R` and stop early; with an R-tree, run a range query on the `R`-disk's bounding box then filter. If no courier lies within `R`, return "no courier" for that customer. The radius cap also speeds queries because it caps how far you must search.
Explanation
Rubric: (1) a correct brute force with the right tie-break (smallest id) and O(n*m) analysis; (2) a spatial index over couriers - uniform grid (O(n+m) expected), k-d tree (O((n+m) log m)), or R-tree - with honest worst-case caveats; (3) an explicit distance-metric choice (Euclidean projection vs. great-circle/haversine for real lat/lng) and use of squared distance to avoid sqrt; (4) deterministic handling of ties, duplicate points, and float precision; (5) follow-ups that pick a grid/R-tree for dynamic fleet updates, batch queries by spatial locality, and reduce the max-radius case to a bounded range query. Strong candidates compare distances squared and reason about clustering degrading per-query cost.