## System Design: Coffee Ordering System
Design a system for a coffee shop (or chain) that supports ordering drinks.
### Core use cases
- Customer browses menu and places an order (pickup or in-store).
- Customer customizes items (size, milk type, add-ons) and sets quantity.
- System calculates total price (including tax/discounts).
- Payment can be processed (or order marked as pay-at-store).
- Baristas see a **queue** of incoming orders and update status: `PLACED → PAID (optional) → IN_PROGRESS → READY → COMPLETED/CANCELLED`.
- Customer can see order status updates.
### Constraints / scale (assume)
- Start with one store, then extend to many stores.
- Peak: thousands of orders/min across all stores.
- Status updates should feel near-real-time to customers.
### What to cover
- APIs (or events) for placing orders and status updates.
- Data model for menu, orders, payments.
- High-level architecture and major services.
- Consistency, idempotency, and failure handling (e.g., retries, duplicate submits).
- How you would scale to many stores.
Quick Answer: This question evaluates system design competency, focusing on order lifecycle modeling and state-machine enforcement, idempotency and data consistency, event-driven integration, real-time status propagation, and scaling across stores.
Solution
## Coffee Ordering System — Model Answer
This is a time-boxed round, so I'll spend most of it on the order state machine, idempotency/payment correctness, and the scale-out story, and keep the rest tight.
### 1) Requirements & scope
**Functional**
- Customer browses a **per-store** menu, customizes items (size, milk, add-ons), sets quantity, and places an order for **pickup** or **in-store**.
- Pricing is computed **server-side**: line totals, tax, discounts/promos.
- Payment is **online** (card/wallet via a payment service provider, "PSP") or **pay-at-store**.
- An order moves through a state machine: `PLACED → (PAID) → IN_PROGRESS → READY → COMPLETED`, with `CANCELLED` reachable from any non-terminal state **up to and including `IN_PROGRESS`** (cancelling an already-paid order there triggers a refund). Once a drink is `READY` we don't cancel it (see §7).
- Baristas at a store see a live **queue** and advance status.
- Customer sees near-real-time status and is notified when the drink is `READY`.
**Non-functional**
- Place-order and status-read latency sub-second at p99.
- High availability during the morning rush; a store briefly unable to reach the cloud should degrade gracefully, not lose orders.
- **Correctness over availability for money and order submission**: no duplicate orders, no double charges; idempotent ("effectively-once") payment capture over at-least-once webhook delivery.
- Status updates feel real-time (low single-digit seconds).
- Full audit trail for every order/payment state change.
**Explicitly out of scope** (state it, then move on): inventory/supply chain, loyalty accrual beyond a discount hook, delivery/driver dispatch, POS hardware. I'll leave hooks where they'd plug in.
**Clarifications I'd lock down**: tax/menu are **per store** (yes); only "make it now" for now, with a `requested_for` field reserved for order-ahead; single region first with multi-region as a scale-out axis.
---
### 2) Back-of-envelope sizing
Anchor component choices with numbers.
- Peak: **thousands of orders/min across all stores**. Take $\approx 6{,}000$ orders/min $= 100$ orders/sec.
- Reads dominate. Menu browsing + status polling/streaming is the bulk; assume $\sim 50{:}1$ read:write → $\approx 5{,}000$ read QPS at peak. Menu reads are near-100% cacheable at the edge.
- Order size $\sim 1\text{KB}$ (line items + modifiers JSON); payment record $\sim 0.5\text{KB}$.
- Daily volume is bursty (morning peak), so a flat $100/\text{s} \times 86{,}400 \approx 8.6$M/day over-counts; call it **$1\text{–}5$M orders/day**.
- Order storage: $5\text{M/day} \times 1.5\text{KB} \approx 7.5\text{GB/day} \approx 2.7\text{TB/yr}$ — trivially handled by a partitioned relational store; archive cold orders to object storage after ~90 days.
- Realtime fan-out $\approx$ orders in flight. With $\sim 5$ min from `PLACED` to `READY` and $100/\text{s}$, that's $\sim 30\text{K}$ in-flight orders → $\sim 30\text{K}$ concurrent SSE/WebSocket connections — comfortable for a small fleet of stateless gateways.
**Takeaway**: raw write throughput (100/s) is *not* the hard part. The hard parts are (a) idempotent order create + effectively-once payment capture, and (b) low-latency per-store realtime fan-out. No exotic infra needed for throughput.
---
### 3) High-level architecture
```
Customer app ─┐ ┌─ Menu Service ── Menu DB (doc) + Redis + CDN
(web/mobile) │ │
├─ API Gateway / BFF ─────┼─ Order Service ── Orders DB (Postgres, sharded by store_id)
Barista app ──┘ (auth, rate limit, │ │ └─ Outbox table
routing) │ │
├─ Payment Service ── Payments DB ── PSP (Stripe/etc.)
│ ▲ webhook
├─ Queue/Projection Service ── Redis (per-store sorted sets)
│
└─ Realtime Gateway (WebSocket/SSE) ── Notification Service
▲
Event Bus (Kafka/PubSub): OrderPlaced, PaymentCaptured,
OrderStatusChanged
```
**Services**
- **API Gateway / BFF**: TLS, authN, rate limiting, routing, thin per-client aggregation (mobile vs. barista tablet).
- **Menu Service** (read-heavy): per-store menu + pricing/modifier config. Document store, fronted by Redis + CDN, version-based invalidation.
- **Order Service** (transactional core): order creation, pricing, the state machine, cancellation. Owns the source-of-truth `orders` table and an **outbox** for reliable event emission.
- **Payment Service**: owns payment intents/results, talks to the PSP, and consumes PSP webhooks as the source of truth for online capture. Also records **pay-at-store capture** when a barista marks an order paid.
- **Queue/Projection Service**: consumes order events into a fast, store-scoped read model for baristas (Redis sorted sets keyed by store).
- **Realtime Gateway**: holds customer/barista WebSocket/SSE connections, pushes status changes from the bus.
- **Notification Service**: push/SMS/email on `READY` and `CANCELLED`.
- **Event Bus**: decouples writes from projections, notifications, analytics.
**Why split Order and Payment?** Different consistency and failure profiles — payment involves a slow external PSP and webhook callbacks, and we never want a PSP timeout to roll back a created order. The **outbox + event bus** lets the order commit locally while the rest happens asynchronously and reliably.
---
### 4) Data model
Relational for orders/payments (transactions, clear invariants); a document store for the flexible, deeply-nested menu.
```sql
-- Catalog (per store; many fields store-overridable)
stores(store_id PK, name, timezone, address, tax_rate, hours_json,
is_online BOOL, capacity_config_json)
menu_items(item_id PK, store_id FK, name, category, base_price_cents,
is_available BOOL, version BIGINT)
modifiers(mod_id PK, item_id FK, name, price_delta_cents,
group_name, -- e.g. "Milk", "Size"
selection_rule) -- one-of | many-of | required
-- Orders (sharded by store_id)
orders(order_id PK, user_id, store_id FK, status,
fulfillment_type, -- PICKUP | IN_STORE
pay_mode, -- ONLINE | AT_STORE
subtotal_cents, tax_cents, discount_cents, total_cents, currency,
idempotency_key UNIQUE, -- dedupe key for create
requested_for TIMESTAMPTZ NULL, -- reserved for order-ahead
created_at, updated_at)
order_lines(line_id PK, order_id FK, item_id, qty,
unit_price_cents, -- snapshot at order time
modifiers_json, -- chosen modifiers + deltas, snapshotted
line_total_cents)
-- One row per capture, ONLINE or AT_STORE: the single "money moved" record.
payments(payment_id PK, order_id FK,
provider, -- 'stripe' (online) | 'in_store' (POS)
intent_id, -- PSP intent; NULL for in_store
provider_ref UNIQUE, -- PSP charge id OR POS receipt id; dedupe on this
status, -- REQUIRES_PAYMENT | CAPTURED | FAILED | REFUNDED
captured_by, -- barista actor_id, for in_store
amount_cents, currency, created_at, updated_at)
-- Audit / event log of status (append-only)
order_events(event_id PK, order_id FK, event_type, from_status, to_status,
actor_id, actor_role, payload_json, created_at)
-- Reliable event emission (written in the same txn as the order)
outbox(id PK, aggregate_id, topic, payload_json, published BOOL, created_at)
```
**Design notes**
- **Prices are snapshotted** onto `order_lines`/`orders` at creation. A later menu price change can't alter a placed order — a correctness requirement, not an optimization.
- `idempotency_key UNIQUE` — the database, not application logic, is the final arbiter of "duplicate."
- `provider_ref UNIQUE` — a redelivered PSP webhook (or a re-tapped POS "mark paid") can't capture twice.
- **Pay-at-store capture is a real `payments` row**: when the barista collects cash/card and marks paid, the Order Service writes `provider='in_store'`, `status='CAPTURED'`, and a POS-receipt `provider_ref`. This gives both payment paths the same "money moved" backing record, so spend queries (§13) key off `payments` uniformly.
- Modifiers stored as a JSON snapshot on the line — the catalog `modifiers` table can change freely without rewriting history.
---
### 5) Key APIs
```
# Menu (cacheable, ETag/version on response)
GET /v1/stores/{storeId}/menu -> menu + modifiers + prices
# Order
POST /v1/orders (Idempotency-Key header required)
body: { store_id, fulfillment_type,
items:[{item_id, qty, modifier_ids:[...]}], payment_method }
resp: { order_id, status, subtotal, tax, discount, total, client_secret? }
GET /v1/orders/{orderId} -> full order + status
POST /v1/orders/{orderId}/cancel -- CUSTOMER self-service cancel, only while
PLACED or PAID (before the barista starts it).
Cancelling IN_PROGRESS is a STAFF /transition.
# Payment
POST /v1/payments/webhook -- PSP -> us; signature-verified
PaymentCaptured / PaymentFailed
# Barista / status (RBAC: barista role, scoped to their store)
GET /v1/stores/{storeId}/queue?status=PLACED,IN_PROGRESS,READY
POST /v1/orders/{orderId}/transition body: { to_status, expected_status }
-- the ONE authoritative state-change endpoint. Staff advance status here,
-- and cancel an IN_PROGRESS order ({to_status: CANCELLED}) -> kicks off refund.
# Realtime
WS /v1/orders/{orderId}/stream -- customer streams their order's events
WS /v1/stores/{storeId}/queue/stream -- barista tablet streams the live queue
```
Server **re-prices** on `POST /orders` from authoritative store config — client totals are advisory only. Both `/cancel` and `/transition` carry/derive `expected_status` for optimistic concurrency (compare-and-set). There is exactly **one** place transitions are validated and applied — the Order Service's `transition()` (§7) — and both endpoints funnel into it, so there's no looser cancel path that could bypass the state machine.
---
### 6) Core flows
**A) Place order with online payment**
```
1. Client POST /orders with Idempotency-Key.
2. Order Service:
a. INSERT ... ON CONFLICT (idempotency_key) DO NOTHING.
If conflict: load and return the existing order (idempotent retry).
b. Validate items belong to the store & are available; re-price server-side.
c. In ONE txn: orders(status=PLACED, pay_mode=ONLINE) + order_lines
+ order_events + outbox('OrderPlaced'). COMMIT.
3. Payment Service creates a PSP payment intent; returns client_secret.
4. Client confirms with the PSP directly (SCA/3DS happens client-side).
5. PSP -> POST /payments/webhook (PaymentCaptured). Verify signature,
dedupe on provider_ref, write payments(provider='stripe', CAPTURED),
transition PLACED->PAID, emit OrderStatusChanged.
6. Queue projection adds the order to the store's barista queue (after PAID,
for online-pay orders).
```
**B) Pay-at-store**: skip steps 3–5; create the order `pay_mode=AT_STORE`, queued at `PLACED`. The state machine allows `PLACED → IN_PROGRESS` without `PAID` for this path. At handoff the barista marks it paid: the Order Service writes a `payments` row (`provider='in_store'`, `CAPTURED`, `captured_by=barista`, POS-receipt `provider_ref`) **in the same transaction** as the `READY → COMPLETED` transition. So a pay-at-store order never carries a `PAID` order-status, yet "money moved" is recorded when cash/card is collected — a completed order has a CAPTURED payment, one abandoned in the queue does not.
**C) Status update**
```
Barista POST /orders/{id}/transition {to_status: IN_PROGRESS, expected_status: PAID}
-> validate transition against the state machine,
compare-and-set on current status, append order_events, write outbox.
-> OrderStatusChanged flows to Realtime Gateway (customer sees it)
and Notification Service (push when READY).
```
---
### 7) The order state machine (the hard part)
Transitions are **explicit and enforced server-side**; clients never set status directly. A transition table plus a compare-and-set prevents two baristas double-advancing the same order.
```python
ALLOWED = {
"PLACED": {"PAID", "IN_PROGRESS", "CANCELLED"}, # IN_PROGRESS only if pay_at_store
"PAID": {"IN_PROGRESS", "CANCELLED"},
"IN_PROGRESS": {"READY", "CANCELLED"}, # staff cancel here triggers refund
"READY": {"COMPLETED"}, # a made drink isn't cancellable
"COMPLETED": set(), # terminal
"CANCELLED": set(), # terminal
}
def transition(order_id, to_status, expected_status, actor):
# 1. Validate BEFORE touching the DB. .get(..., set()) so a terminal/unknown
# expected_status is a clean rejection, not a KeyError. Nothing written yet.
if to_status not in ALLOWED.get(expected_status, set()):
raise InvalidTransition(expected_status, to_status)
# 2. Atomic compare-and-set: advance only if DB status still matches expected.
# The WHERE clause is the concurrency guard.
rows = db.execute("""
UPDATE orders SET status = :to, updated_at = now()
WHERE order_id = :id AND status = :expected
RETURNING order_id
""", to=to_status, id=order_id, expected=expected_status)
if not rows: # someone else moved it, or stale read
raise ConflictError("status changed concurrently")
# 3. Audit + outbox in the SAME transaction as the UPDATE (§9).
append_event(order_id, expected_status, to_status, actor)
write_outbox(order_id, "OrderStatusChanged", {"to": to_status})
```
Validation comes **first**, so an illegal target (e.g. `READY` from `PAID`) is rejected before any row is written; the DB never holds an illegal status. The `WHERE status = :expected` clause is the concurrency guard — if a teammate already advanced the order, the `UPDATE` touches 0 rows and we reject the stale transition. Validity check + UPDATE + audit + outbox run in one transaction, so a failure at any step leaves the order untouched.
**Cancellation scope**: `CANCELLED` is reachable from `PLACED`, `PAID`, and `IN_PROGRESS` — every non-terminal state up to and including `IN_PROGRESS`, but **not** from `READY` (a freshly-made drink is completed or written off out-of-band, not cancelled). This is a deliberate narrowing of the prompt's literal `READY → COMPLETED/CANCELLED`, which I'd call out to the interviewer. The customer `/cancel` endpoint only permits the `PLACED`/`PAID` window; cancelling `IN_PROGRESS` is a staff `/transition`, and that `IN_PROGRESS → CANCELLED` on a paid order is what kicks off a PSP refund.
---
### 8) Real-time status delivery
Updates must land within seconds over long-lived, flaky mobile connections.
- **Transport**: WebSocket for the barista tablet (bidirectional, persistent); **SSE** is simpler and sufficient for the customer's one-way status stream. Fall back to **short polling** of `GET /orders/{id}` if the socket drops.
- **Fan-out path**: `OrderStatusChanged` → event bus → Realtime Gateway nodes. Customer connections keyed by `order_id`; barista connections subscribe to a `store_id` channel.
- **Stateless gateways + Redis pub/sub**: any gateway node can hold any connection, so publish to a Redis pub/sub channel per `store_id` (and per `order_id`); the publisher needn't know which node holds the socket. $\sim 30\text{K}$ concurrent connections is comfortable.
- **Catch-up on reconnect**: the stream returns the **current** order status on connect (read from `orders`), so a client that missed an event while offline is immediately correct. Events carry a monotonically increasing per-order sequence; a client can request a replay from `order_events`.
---
### 9) Consistency, idempotency & failure handling
This is where the design earns its keep.
- **Idempotent order create**: client supplies an `Idempotency-Key`; the `UNIQUE` constraint makes "create the same order twice" physically impossible. Double-click / network-retry returns the same `order_id`.
- **Effectively-once payment capture**: PSP webhook delivery is at-least-once, so make the *consumer* idempotent rather than claiming exactly-once delivery. Treat **PSP webhooks as the source of truth** for online payments, never the client's "success" callback (it can lie or drop). Dedupe on `UNIQUE provider_ref` — a redelivered `PaymentCaptured` is a no-op, net effect one capture. Pay-at-store "mark paid" dedupes the same way (POS receipt id = `provider_ref`). **Verify the webhook signature** before trusting it.
- **Atomic write + reliable events (outbox)**: the order row and its `outbox` event are written in the **same DB transaction**, so an order can't commit without its event (or vice versa). A relay polls `outbox` (or via CDC/logical decoding), publishes to the bus, and marks rows `published`. This avoids the dual-write problem (DB commit succeeds, publish fails).
- **State machine + compare-and-set**: invalid transitions rejected in-process before any write; concurrent transitions rejected by the DB compare-and-set (§7).
- **PSP/order divergence**: an order can be `PLACED` but never confirm (customer abandons 3DS). A sweeper expires unpaid online orders after $N$ minutes (`PLACED → CANCELLED`), releasing the slot. Conversely, if payment captured but the order row didn't update, the webhook handler is the reconciliation point and retries the transition.
- **Refunds**: cancelling a `PAID`/`IN_PROGRESS` online order emits a refund command; the order reaches `CANCELLED` only after the refund is acknowledged (or is marked `REFUND_PENDING`). For pay-at-store, a cancel before capture moves no money (no `payments` row yet) — nothing to refund.
- **Retries**: all write endpoints are safe to retry (idempotent); reads are trivially retryable.
---
### 10) Scaling to many stores
The natural partition key is the **store** — orders, queues, and barista views are almost always store-scoped.
- **Shard `orders` by `store_id`** (or a hash). All of a store's reads/writes hit one shard; no cross-shard query on the hot path. Adding stores adds shards horizontally.
- **Per-store queue projections**: a **Redis sorted set per store**, score = priority/`created_at`, gives baristas $O(\log N)$ insert and $O(\log N + k)$ range reads — fast and isolated. Rebuildable from `order_events` if Redis is lost.
- **Menu at the edge**: per-store, rarely-changing menus served from CDN with version/ETag; invalidate by bumping the store's menu `version`. Near-100% hit ratio removes menu reads from the origin.
- **Realtime gateways scale horizontally** (stateless + Redis pub/sub), partitioned by store channel.
- **Async everything non-critical**: notifications, analytics, projections are bus consumers, so a slow notification provider never slows order creation.
- **Multi-region (later)**: pin a store's data to the nearest region; cross-region orders are rare, keeping latency low without global consensus.
**Bottleneck honesty**: the real hot spots are (1) the Payment Service ↔ PSP path (external, slow — must be async + idempotent) and (2) realtime connection count at peak (mitigated by stateless gateways). Order writes at $\sim 100/\text{s}$ are not a bottleneck for a sharded relational store.
---
### 11) Security, privacy, observability
- **AuthZ (RBAC)**: customers see only their orders; baristas are scoped to their store's queue; admins manage menu/stores. Enforced at the gateway and re-checked in services.
- **Payments**: never store raw card data — tokenize via the PSP (PCI scope minimization). Verify every webhook signature.
- **PII**: minimize and encrypt customer contact fields at rest; redact in logs.
- **Observability**: order-create p99, payment success rate, **time-in-status per stage** (SLA on `PLACED→READY`), per-store queue length, webhook failure rate, outbox publish lag. Distributed tracing across Order → Payment → Queue. Alert on queue-length spikes (a store falling behind) and outbox lag (events not flowing).
---
### 12) Edge cases & follow-up scenarios
- **Item sells out after order**: barista marks unavailable → order offered cancel/refund or substitution.
- **Lost PSP webhook — payment captured, order stuck (follow-up #3)**: the PSP captured the money but the `PaymentCaptured` webhook never arrived, so the order is stranded at `PLACED`. Two mechanisms reconcile it, both already in the design (§9): (1) the **webhook handler is the reconciliation point** — when a later/redelivered webhook does land it re-runs the `PLACED → PAID` transition, and because capture is idempotent on `provider_ref` and the transition is a compare-and-set, replay is safe; (2) for the case where *no* webhook ever arrives, a **reconciliation sweeper** periodically pulls payment status from the PSP (or its events API) for orders that have been `PLACED` past a threshold, and on finding a CAPTURED charge writes the `payments` row and drives the same `PLACED → PAID` transition. The unpaid-order sweeper (§9) handles the inverse — genuinely abandoned orders are expired `PLACED → CANCELLED`. Net: a lost webhook never permanently strands a paid order.
- **Store loses connectivity during the rush (follow-up #4, part 1)**: set `is_online=false` and reject new *online* orders for that store with a clear message (don't silently queue payments we can't process). The **in-store POS keeps taking orders locally** in a local buffer with locally-generated order ids that already carry an `idempotency_key`; on reconnect it replays them to the cloud. Because create is idempotent on `idempotency_key` and capture is idempotent on `provider_ref`, replaying a batch can't duplicate orders or double-charge — orders the cloud already saw are no-ops. No order or payment is lost; at worst it's delayed until reconnect.
- **Scheduled order-ahead (follow-up #4, part 2)**: the `requested_for` field already reserves this. The change is mostly in *scheduling*, not the core model: don't inject the order into the live barista queue until `requested_for - lead_time`; the Queue/Projection Service uses `requested_for` (not `created_at`) as the sorted-set score so the queue is ordered by promised time; a scheduler fires the "make it now" event near pickup; capacity config per store caps how many ahead-orders a slot accepts. Payment can be captured up front or at pickup. The state machine is unchanged.
- **Duplicate / out-of-order events**: handled by `provider_ref` uniqueness and per-order event sequence numbers.
- **Timezone/tax per store**: ready-time estimates and tax come from the store's `timezone`/`tax_rate`, never the client's locale.
---
### 13) Follow-up: the two SQL questions
Because the design records **every capture — online and pay-at-store — as a `CAPTURED` row in `payments`** (§4), "money that actually moved" has one uniform definition across both payment paths. I key the spend queries off `payments`. (If asked to answer over `orders` alone, I'd flag that order-status is *not* a sound proxy for spend: a pay-at-store order reaches `IN_PROGRESS`/`READY`/`COMPLETED` whether or not cash was collected, so no `orders.status` set cleanly equals "paid." That's exactly why real spend must look at captures.)
**(a) The user who spent the most last month**
```sql
SELECT o.user_id, SUM(p.amount_cents) AS spend_cents
FROM payments p
JOIN orders o ON o.order_id = p.order_id
WHERE p.status = 'CAPTURED' -- money moved (online OR in_store)
AND p.created_at >= date_trunc('month', current_date) - interval '1 month'
AND p.created_at < date_trunc('month', current_date) -- half-open [first-of-last-month, first-of-this-month)
GROUP BY o.user_id
ORDER BY spend_cents DESC
LIMIT 1;
```
Summing `payments.amount_cents` (not `orders.total_cents`) also makes partial refunds fall out naturally if we later record `REFUNDED`/partial-capture rows — spend follows the money, not the order header. (Use `LIMIT 1 WITH TIES` or a window function if ties matter.)
**(b) Users who have never spent anything** (registered but no captured payment, ever)
The question is a **per-user** existence test ("does this user have *any* captured payment?"), so the anti-join must be expressed per user — `NOT EXISTS` is the clean, correct shape:
```sql
SELECT u.user_id
FROM users u
WHERE NOT EXISTS (
SELECT 1
FROM orders o
JOIN payments p ON p.order_id = o.order_id
WHERE o.user_id = u.user_id
AND p.status = 'CAPTURED' -- same "captured = spend" rule as (a)
);
```
**Why not the chained `users → orders → payments` LEFT JOIN + `IS NULL`?** It's the obvious-looking form, and it's *wrong* here. Joining out to per-order rows means a user with two orders — one captured, one not — produces a surviving `payment_id IS NULL` row for the un-captured order, so a user who *did* spend is returned as "never spent." The LEFT-JOIN anti-join only works if you first collapse to one row per user (e.g. anti-join against a `DISTINCT user_id` set of spenders), which `NOT EXISTS` does implicitly. If you do write it as a LEFT JOIN, anti-join against the **per-user spender set**, not the per-order payment rows:
```sql
SELECT u.user_id
FROM users u
LEFT JOIN (
SELECT DISTINCT o.user_id
FROM orders o
JOIN payments p ON p.order_id = o.order_id
WHERE p.status = 'CAPTURED'
) spenders ON spenders.user_id = u.user_id
WHERE spenders.user_id IS NULL;
```
Two further classic traps worth naming: the `status = 'CAPTURED'` predicate must live in the spender subquery (an inner `JOIN` here), **not** in an outer `WHERE` on the right side of a `LEFT JOIN` — that would silently collapse the outer join into an inner one and drop the users we want. And prefer `NOT EXISTS` over `NOT IN (SELECT ...)`, which returns *no rows at all* if the subquery yields any `NULL`. (Assumes a `users` table — a standard interview assumption.)
---
### Summary
A small set of services around a transactional **Order Service** (sharded by `store_id`), a decoupled **Payment Service** that trusts PSP webhooks for online pay and records a POS capture row for pay-at-store, and a fast **per-store Redis queue projection** for baristas. Correctness is bought cheaply with three primitives: a `UNIQUE` idempotency key on order create, a `UNIQUE provider_ref` for effectively-once capture on **both** payment paths, and an **outbox** so events never diverge from the DB. State changes — including the only cancellation that triggers a refund, `IN_PROGRESS → CANCELLED` — funnel through one server-side `transition()` guarded by a compare-and-set. Real-time status rides the event bus to stateless WebSocket/SSE gateways. Throughput ($\sim 100$ writes/s) is easy; the real engineering is idempotency, the payment/PSP failure surface, store-offline replay, and per-store realtime fan-out — and that's where the design concentrates its effort.