Design a Superhero Dispatch System
Company: Stripe
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
Design the backend for a **superhero rescue marketplace** — a dispatch platform that connects civilians in distress with nearby superheroes.
Civilians report emergencies (accidents, fires, crimes). For each report the platform must create a **rescue request**, identify appropriate nearby superheroes, notify them, let **exactly one** hero accept the request and travel to the incident, and let dispatchers monitor request status end to end. The problem is structurally similar to ride-hailing dispatch (Uber/Lyft), but with two important differences: request volume is far **lower** than a major ride-sharing platform, and **reliability matters more than maximizing throughput** — a dropped or double-assigned emergency is a much worse failure than a dropped ride request.
Focus on **system architecture and technical tradeoffs**, not on REST/RPC endpoint design. The interviewer's primary signal is the strength of your *justification* for each technology and consistency choice. Your design should address: functional and non-functional requirements; the main services and the storage chosen for each; the data model for civilians, heroes, incidents, offers, and assignments; how hero location and availability are tracked and queried by radius; matching and dispatch logic; the incident state machine from creation to completion; concurrency control for the core acceptance race; notifications, offer timeouts, retries, and failure handling; and observability and operational concerns.
```hint Where to start
Different parts of this system have very different correctness needs. Before picking any storage engine, sort your data into two buckets: the one piece you absolutely cannot get wrong (the final assignment) versus the pieces you can afford to have slightly stale (live location, dashboards, analytics). Most later choices fall out of that split.
```
```hint The core race
Two heroes tapping "accept" at the same instant is the crux. A two-step "check if still open, then write" has a window between the check and the write where both can pass. Think about whether acceptance can instead be expressed as a *single* operation that can only succeed for one hero — and how a hero would learn whether they won or lost. Then be ready to argue *which component* should be the authority that enforces this (an app-level lock? the datastore itself?) and why.
```
```hint Location vs. truth
Hero GPS is high-churn and only the freshest value matters. Consider whether the structure that efficiently answers "who is near this incident *right now*?" has to be the same store you trust at commit time, or whether you can separate the read path used for *matching* from the transactional store used for *assignment* — and what each kind of store is good at.
```
```hint Don't lose the request
For an emergency platform the nastiest failure is the quiet one: the request persists but the downstream dispatch work never kicks off. Notice that "write the incident to the DB" and "tell the rest of the system to start dispatching" are *two separate side effects* — if they aren't tied together, one can succeed while the other silently fails. Think about how to make those two outcomes share a single fate rather than relying on a best-effort write-then-publish.
```
### Constraints & Assumptions
State your own numbers, but a reasonable working set is:
- ~1,000,000 registered civilians; ~10,000 heroes active in a large metro area.
- Peak load ~500 new incidents per minute metro-wide (far below ride-hailing's millions/hour) — the system is **write-light but correctness-critical**; do not over-engineer for millions of QPS.
- Hero location pings every 5–15 seconds while on duty.
- Target time-to-first-offer: low single-digit **seconds**; offer-acceptance window ~10–20 seconds.
- A single-hero assignment is the default; multi-hero incidents are an extension, not the base case.
- **Hard correctness invariant:** a single-hero incident must end up assigned to **at most one** hero, even under concurrent acceptances and partial failures.
- Eventual consistency is acceptable for live-location display, dashboards, and analytics; **strong consistency is required for the final assignment**.
### Clarifying Questions to Ask
- Can a single incident require **multiple heroes** (e.g. a large fire needing a medic plus a fire-rescue hero), or is it always one hero per incident in scope?
- What does "nearby" mean — straight-line radius, or estimated travel time accounting for the hero's mode (ground vs. flight)?
- How aggressively may we contact a hero — push-only, or escalate to SMS / voice for severe incidents — and is there an SLA for reaching one?
- Is there a **human dispatcher** in the loop as a fallback, and can they manually override, cancel, or reassign?
- What are the relative consequences of a *false* dispatch (sending a hero who isn't really needed) versus a *missed* dispatch?
- What is the data-retention / auditability requirement for incidents (regulatory or post-incident review)?
### What a Strong Answer Covers
- A clearly stated **consistency boundary** — strong for assignment, eventual for location/analytics — with the reliability-over-throughput rationale made explicit.
- **Service decomposition** with sensible boundaries and matching storage choices: relational for transactional entities, an in-memory geospatial index for live locations, a durable log/queue for event-driven workflow — not one database for everything, nor a microservice per noun.
- A correct, **race-free assignment mechanism** that survives concurrency *and* partial failure, with an explanation of *why* it is correct (atomicity, idempotency).
- A complete **incident state machine** with validated transitions and terminal states, owned centrally, including the unhappy paths (timeout, decline, cancel, hero drop-off, no hero found).
- **Reliable event publication** (idempotency keys / dead-letter queues / a write that can't lose its follow-on event) rather than fire-and-forget messaging.
- A **dispatch strategy** (sequential vs. small-batch, radius expansion, ranking signals) with the latency-vs-double-assignment tradeoff named.
- **Operational maturity**: stuck-incident detection, dispatcher override, metrics, alerts, location-freshness monitoring.
- Sensible **scoping** — explicitly *not* over-engineering for ride-share-scale QPS, given the stated lower volume — and an explicit **justification** for each major technology choice, since that is the stated weak point the interviewer probes.
### Follow-up Questions
- A city-wide disaster causes a 50× burst of incidents in one neighborhood while the rest of the metro is quiet. What degrades first, and how do you keep the system from melting down?
- Extend the design to **multi-hero incidents** that need several heroes with complementary skills arriving together — how do matching, offers, and the state machine change, while still guaranteeing no hero is double-booked?
- An assigned hero goes **unreachable mid-rescue** (missed heartbeats) and never marks the incident resolved. How do you detect this and safely reassign without double-dispatching or stranding the civilian?
- You must add a machine-learned model that predicts each hero's acceptance probability and ETA to improve ranking. Where does it sit in the architecture, and how do you keep matching latency in seconds?
Quick Answer: This question evaluates a candidate's ability to design reliable, distributed backend systems with emphasis on transactional correctness, concurrency control for single-winner acceptance races, geo-spatial querying for proximity matching, eventing/notification pipelines, and operational observability for mission-critical dispatch.