Design Crypto Order Routing
Company: Coinbase
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Onsite
Design the **order-placement system** for a cryptocurrency trading product.
The product lets users place, cancel, and track crypto buy and sell orders. Your company does **not** run the matching engine itself. Instead, it routes orders to one or more third-party matching venues (exchanges). Different venues may expose different APIs or protocols (REST, WebSocket, FIX, or proprietary), and any single venue may be slow, rate-limited, or temporarily down.
Design a system that supports:
- **Market and limit orders.**
- **Order submission, cancellation, and status tracking.**
- **Multiple external matching venues** behind a uniform internal interface.
- **Balance checks and fund reservation** before an order is submitted externally.
- **Execution reports, partial fills, failed orders, and reconciliation** against venue state.
- **High availability, correctness, auditability, and low-latency** order handling.
Cover the full picture: public APIs, data model, core services, the order state machine, routing logic, failure handling, scaling, observability, and the key trade-offs.
```hint Where to start
This is a financial system: decide up front what you'll prioritize — raw latency, or correctness and auditability — for anything that touches money or order state. Also ask what guarantees you actually get from each external venue, and how much you can trust them.
```
```hint Decompose the services
Notice that different concerns have different consistency needs: placing/validating an order, holding and reserving funds, deciding where to send the order, and talking to each heterogeneous venue protocol. Would one service own all of that, or would you separate them? Let the consistency requirements drive the boundaries.
```
```hint The hard part is failure
Push on the worst case: your request to a venue *times out after you sent it.* Did the venue get the order or not? Think about what order status you'd record in that moment, and what would have to be true for a safe automatic retry. How would you later find out the real outcome without guessing?
```
```hint Money handling
A single mutable balance integer can't tell you *why* it changed or whether an open order has already claimed some of it. What richer representation of a user's funds would let you reserve money for a pending order, and reconstruct the full history of every movement for an audit?
```
### Constraints & Assumptions
State your own numbers, but a reasonable working set:
- Order submit / cancel p99 latency target on the synchronous internal path: **tens of milliseconds** (excluding the external venue round-trip, which is venue-bound).
- Order throughput: design for thousands of orders/sec at peak with headroom; execution-report volume can be several multiples of that (one order produces many fills/acks).
- 3–10 external venues, each with its own protocol, rate limits, fee schedule, and reliability profile.
- A single internal order may be **split across multiple venues** and produce **multiple partial fills**.
- Funds are custodial (the platform holds user balances); fund reservation happens before any external send.
- Strong consistency is required for balances and order-state transitions; eventual consistency is acceptable for read-only views (history, dashboards).
### Clarifying Questions to Ask
- Is the platform **custodial** (we hold user funds and reserve from an internal balance) or non-custodial (users sign on-chain)? This changes the entire fund-reservation model.
- Do we need **smart order routing / order splitting** across venues from day one, or is single-venue routing an acceptable MVP?
- Which **order types and time-in-force** options must we support (market, limit, IOC, FOK, GTC)? Do we need stop/conditional orders?
- What are the **regulatory/compliance** requirements (sanctions screening, trading limits, audit retention, jurisdictional restrictions)?
- Do venues support a **client-supplied order ID** for idempotent submission and lookup, and do their execution feeds carry **sequence numbers**?
- What is acceptable behavior during a **venue outage** — reject new orders, queue them, or fail over to another venue?
### What a Strong Answer Covers
- **Requirements split** into functional and non-functional, with correctness and auditability called out as first-class, not afterthoughts.
- A clear **service decomposition** in which each component has a single responsibility, and the boundaries are justified by differing consistency and reliability needs.
- A **public API** whose retry semantics are safe (a re-sent submit can't create a duplicate order) and that distinguishes "accepted internally" from "accepted by the venue."
- A well-defined **order lifecycle / state machine** with validated transitions, terminal states, and an explicit story for the ambiguous "we don't know what the venue did" case.
- A **fund-handling model** that is auditable and reconstructable, with correct reserve/release semantics across buy vs. sell, limit vs. market, and partial fills.
- **Routing logic** that starts simple and can grow toward multi-venue execution, with the trade-offs made explicit.
- Rigorous **failure handling**: post-send timeouts, safe retries, out-of-order or duplicated execution reports, venue disconnects and resync, and a reconciliation strategy.
- A coherent **consistency model**, plus **scaling/partitioning**, **observability** (the metrics and alerts that matter for a money system), and **security/compliance**.
- Explicit **trade-offs** rather than a single "right" answer.
### Follow-up Questions
- A venue ACK is lost but the order actually filled there; reserved funds are stuck and the user re-submits. Walk through exactly how reconciliation detects and repairs this without double-charging or double-filling.
- How do you cancel an order that has been **split across two venues** when one leg has partially filled and the other is still open? What does the user see during this window?
- How do you guarantee an execution report is applied **exactly once** to the ledger when venues may redeliver messages and your consumer may crash mid-update?
- How would you extend the design to support **stop-loss / conditional orders**, where the trigger is evaluated against live market data rather than submitted immediately?
Quick Answer: This question evaluates a candidate's ability to architect distributed, low-latency, and fault-tolerant order-placement and routing systems for cryptocurrency trading, covering consistency guarantees, external integration with heterogeneous venue protocols, fund reservation, execution reconciliation, and auditability.