How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Onsite rounds at Stripe.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Stripe during technical interviews.

Design a Superhero Dispatch System | Stripe Interview Question

Q: Design a Superhero Dispatch System

This question evaluates a candidate's ability to design reliable, distributed backend systems with emphasis on transactional correctness, concurrency control for single-winner acceptance races, geo-spatial querying for proximity matching, eventing/notification pipelines, and operational observability for mission-critical dispatch.

Design the backend for a superhero rescue marketplace — a dispatch platform that connects civilians in distress with nearby superheroes.

Civilians report emergencies (accidents, fires, crimes). For each report the platform must create a rescue request, identify appropriate nearby superheroes, notify them, let exactly one hero accept the request and travel to the incident, and let dispatchers monitor request status end to end. The problem is structurally similar to ride-hailing dispatch (Uber/Lyft), but with two important differences: request volume is far lower than a major ride-sharing platform, and reliability matters more than maximizing throughput — a dropped or double-assigned emergency is a much worse failure than a dropped ride request.

Focus on system architecture and technical tradeoffs, not on REST/RPC endpoint design. The interviewer's primary signal is the strength of your justification for each technology and consistency choice. Your design should address: functional and non-functional requirements; the main services and the storage chosen for each; the data model for civilians, heroes, incidents, offers, and assignments; how hero location and availability are tracked and queried by radius; matching and dispatch logic; the incident state machine from creation to completion; concurrency control for the core acceptance race; notifications, offer timeouts, retries, and failure handling; and observability and operational concerns.

Constraints & Assumptions

State your own numbers, but a reasonable working set is:

~1,000,000 registered civilians; ~10,000 heroes active in a large metro area.
Peak load ~500 new incidents per minute metro-wide (far below ride-hailing's millions/hour) — the system is write-light but correctness-critical ; do not over-engineer for millions of QPS.
Hero location pings every 5–15 seconds while on duty.
Target time-to-first-offer: low single-digit seconds ; offer-acceptance window ~10–20 seconds.
A single-hero assignment is the default; multi-hero incidents are an extension, not the base case.
Hard correctness invariant: a single-hero incident must end up assigned to at most one hero, even under concurrent acceptances and partial failures.
Eventual consistency is acceptable for live-location display, dashboards, and analytics; strong consistency is required for the final assignment .

Clarifying Questions to Ask

Can a single incident require multiple heroes (e.g. a large fire needing a medic plus a fire-rescue hero), or is it always one hero per incident in scope?
What does "nearby" mean — straight-line radius, or estimated travel time accounting for the hero's mode (ground vs. flight)?
How aggressively may we contact a hero — push-only, or escalate to SMS / voice for severe incidents — and is there an SLA for reaching one?
Is there a human dispatcher in the loop as a fallback, and can they manually override, cancel, or reassign?
What are the relative consequences of a false dispatch (sending a hero who isn't really needed) versus a missed dispatch?
What is the data-retention / auditability requirement for incidents (regulatory or post-incident review)?

What a Strong Answer Covers

A clearly stated consistency boundary — strong for assignment, eventual for location/analytics — with the reliability-over-throughput rationale made explicit.
Service decomposition with sensible boundaries and matching storage choices: relational for transactional entities, an in-memory geospatial index for live locations, a durable log/queue for event-driven workflow — not one database for everything, nor a microservice per noun.
A correct, race-free assignment mechanism that survives concurrency and partial failure, with an explanation of why it is correct (atomicity, idempotency).
A complete incident state machine with validated transitions and terminal states, owned centrally, including the unhappy paths (timeout, decline, cancel, hero drop-off, no hero found).
Reliable event publication (idempotency keys / dead-letter queues / a write that can't lose its follow-on event) rather than fire-and-forget messaging.
A dispatch strategy (sequential vs. small-batch, radius expansion, ranking signals) with the latency-vs-double-assignment tradeoff named.
Operational maturity : stuck-incident detection, dispatcher override, metrics, alerts, location-freshness monitoring.
Sensible scoping — explicitly not over-engineering for ride-share-scale QPS, given the stated lower volume — and an explicit justification for each major technology choice, since that is the stated weak point the interviewer probes.

Follow-up Questions

A city-wide disaster causes a 50× burst of incidents in one neighborhood while the rest of the metro is quiet. What degrades first, and how do you keep the system from melting down?
Extend the design to multi-hero incidents that need several heroes with complementary skills arriving together — how do matching, offers, and the state machine change, while still guaranteeing no hero is double-booked?
An assigned hero goes unreachable mid-rescue (missed heartbeats) and never marks the incident resolved. How do you detect this and safely reassign without double-dispatching or stranding the civilian?
You must add a machine-learned model that predicts each hero's acceptance probability and ETA to improve ranking. Where does it sit in the architecture, and how do you keep matching latency in seconds?

Design the backend for a superhero rescue marketplace — a dispatch platform that connects civilians in distress with nearby superheroes.

Constraints & Assumptions

State your own numbers, but a reasonable working set is:

~1,000,000 registered civilians; ~10,000 heroes active in a large metro area.
Peak load ~500 new incidents per minute metro-wide (far below ride-hailing's millions/hour) — the system is write-light but correctness-critical ; do not over-engineer for millions of QPS.
Hero location pings every 5–15 seconds while on duty.
Target time-to-first-offer: low single-digit seconds ; offer-acceptance window ~10–20 seconds.
A single-hero assignment is the default; multi-hero incidents are an extension, not the base case.
Hard correctness invariant: a single-hero incident must end up assigned to at most one hero, even under concurrent acceptances and partial failures.
Eventual consistency is acceptable for live-location display, dashboards, and analytics; strong consistency is required for the final assignment .

Clarifying Questions to Ask

Can a single incident require multiple heroes (e.g. a large fire needing a medic plus a fire-rescue hero), or is it always one hero per incident in scope?
What does "nearby" mean — straight-line radius, or estimated travel time accounting for the hero's mode (ground vs. flight)?
How aggressively may we contact a hero — push-only, or escalate to SMS / voice for severe incidents — and is there an SLA for reaching one?
Is there a human dispatcher in the loop as a fallback, and can they manually override, cancel, or reassign?
What are the relative consequences of a false dispatch (sending a hero who isn't really needed) versus a missed dispatch?
What is the data-retention / auditability requirement for incidents (regulatory or post-incident review)?

What a Strong Answer Covers

A clearly stated consistency boundary — strong for assignment, eventual for location/analytics — with the reliability-over-throughput rationale made explicit.
Service decomposition with sensible boundaries and matching storage choices: relational for transactional entities, an in-memory geospatial index for live locations, a durable log/queue for event-driven workflow — not one database for everything, nor a microservice per noun.
A correct, race-free assignment mechanism that survives concurrency and partial failure, with an explanation of why it is correct (atomicity, idempotency).
A complete incident state machine with validated transitions and terminal states, owned centrally, including the unhappy paths (timeout, decline, cancel, hero drop-off, no hero found).
Reliable event publication (idempotency keys / dead-letter queues / a write that can't lose its follow-on event) rather than fire-and-forget messaging.
A dispatch strategy (sequential vs. small-batch, radius expansion, ranking signals) with the latency-vs-double-assignment tradeoff named.
Operational maturity : stuck-incident detection, dispatcher override, metrics, alerts, location-freshness monitoring.
Sensible scoping — explicitly not over-engineering for ride-share-scale QPS, given the stated lower volume — and an explicit justification for each major technology choice, since that is the stated weak point the interviewer probes.

Follow-up Questions

A city-wide disaster causes a 50× burst of incidents in one neighborhood while the rest of the metro is quiet. What degrades first, and how do you keep the system from melting down?
Extend the design to multi-hero incidents that need several heroes with complementary skills arriving together — how do matching, offers, and the state machine change, while still guaranteeing no hero is double-booked?
An assigned hero goes unreachable mid-rescue (missed heartbeats) and never marks the incident resolved. How do you detect this and safely reassign without double-dispatching or stranding the civilian?
You must add a machine-learned model that predicts each hero's acceptance probability and ETA to improve ranking. Where does it sit in the architecture, and how do you keep matching latency in seconds?

Design a Superhero Dispatch System

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design a Superhero Dispatch System

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP