PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Salesforce

Design a coffee ordering system

Last updated: Jun 17, 2026

Quick Overview

This question evaluates system design competency, focusing on order lifecycle modeling and state-machine enforcement, idempotency and data consistency, event-driven integration, real-time status propagation, and scaling across stores.

  • medium
  • Salesforce
  • System Design
  • Software Engineer

Design a coffee ordering system

Company: Salesforce

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

## System Design: Coffee Ordering System Design a system for a coffee shop (or chain) that supports ordering drinks. ### Core use cases - Customer browses menu and places an order (pickup or in-store). - Customer customizes items (size, milk type, add-ons) and sets quantity. - System calculates total price (including tax/discounts). - Payment can be processed (or order marked as pay-at-store). - Baristas see a **queue** of incoming orders and update status: `PLACED → PAID (optional) → IN_PROGRESS → READY → COMPLETED/CANCELLED`. - Customer can see order status updates. ### Constraints / scale (assume) - Start with one store, then extend to many stores. - Peak: thousands of orders/min across all stores. - Status updates should feel near-real-time to customers. ### What to cover - APIs (or events) for placing orders and status updates. - Data model for menu, orders, payments. - High-level architecture and major services. - Consistency, idempotency, and failure handling (e.g., retries, duplicate submits). - How you would scale to many stores.

Quick Answer: This question evaluates system design competency, focusing on order lifecycle modeling and state-machine enforcement, idempotency and data consistency, event-driven integration, real-time status propagation, and scaling across stores.

Solution

## Coffee Ordering System — Model Answer This is a time-boxed round, so I'll spend most of it on the order state machine, idempotency/payment correctness, and the scale-out story, and keep the rest tight. ### 1) Requirements & scope **Functional** - Customer browses a **per-store** menu, customizes items (size, milk, add-ons), sets quantity, and places an order for **pickup** or **in-store**. - Pricing is computed **server-side**: line totals, tax, discounts/promos. - Payment is **online** (card/wallet via a payment service provider, "PSP") or **pay-at-store**. - An order moves through a state machine: `PLACED → (PAID) → IN_PROGRESS → READY → COMPLETED`, with `CANCELLED` reachable from any non-terminal state **up to and including `IN_PROGRESS`** (cancelling an already-paid order there triggers a refund). Once a drink is `READY` we don't cancel it (see §7). - Baristas at a store see a live **queue** and advance status. - Customer sees near-real-time status and is notified when the drink is `READY`. **Non-functional** - Place-order and status-read latency sub-second at p99. - High availability during the morning rush; a store briefly unable to reach the cloud should degrade gracefully, not lose orders. - **Correctness over availability for money and order submission**: no duplicate orders, no double charges; idempotent ("effectively-once") payment capture over at-least-once webhook delivery. - Status updates feel real-time (low single-digit seconds). - Full audit trail for every order/payment state change. **Explicitly out of scope** (state it, then move on): inventory/supply chain, loyalty accrual beyond a discount hook, delivery/driver dispatch, POS hardware. I'll leave hooks where they'd plug in. **Clarifications I'd lock down**: tax/menu are **per store** (yes); only "make it now" for now, with a `requested_for` field reserved for order-ahead; single region first with multi-region as a scale-out axis. --- ### 2) Back-of-envelope sizing Anchor component choices with numbers. - Peak: **thousands of orders/min across all stores**. Take $\approx 6{,}000$ orders/min $= 100$ orders/sec. - Reads dominate. Menu browsing + status polling/streaming is the bulk; assume $\sim 50{:}1$ read:write → $\approx 5{,}000$ read QPS at peak. Menu reads are near-100% cacheable at the edge. - Order size $\sim 1\text{KB}$ (line items + modifiers JSON); payment record $\sim 0.5\text{KB}$. - Daily volume is bursty (morning peak), so a flat $100/\text{s} \times 86{,}400 \approx 8.6$M/day over-counts; call it **$1\text{–}5$M orders/day**. - Order storage: $5\text{M/day} \times 1.5\text{KB} \approx 7.5\text{GB/day} \approx 2.7\text{TB/yr}$ — trivially handled by a partitioned relational store; archive cold orders to object storage after ~90 days. - Realtime fan-out $\approx$ orders in flight. With $\sim 5$ min from `PLACED` to `READY` and $100/\text{s}$, that's $\sim 30\text{K}$ in-flight orders → $\sim 30\text{K}$ concurrent SSE/WebSocket connections — comfortable for a small fleet of stateless gateways. **Takeaway**: raw write throughput (100/s) is *not* the hard part. The hard parts are (a) idempotent order create + effectively-once payment capture, and (b) low-latency per-store realtime fan-out. No exotic infra needed for throughput. --- ### 3) High-level architecture ``` Customer app ─┐ ┌─ Menu Service ── Menu DB (doc) + Redis + CDN (web/mobile) │ │ ├─ API Gateway / BFF ─────┼─ Order Service ── Orders DB (Postgres, sharded by store_id) Barista app ──┘ (auth, rate limit, │ │ └─ Outbox table routing) │ │ ├─ Payment Service ── Payments DB ── PSP (Stripe/etc.) │ ▲ webhook ├─ Queue/Projection Service ── Redis (per-store sorted sets) │ └─ Realtime Gateway (WebSocket/SSE) ── Notification Service ▲ Event Bus (Kafka/PubSub): OrderPlaced, PaymentCaptured, OrderStatusChanged ``` **Services** - **API Gateway / BFF**: TLS, authN, rate limiting, routing, thin per-client aggregation (mobile vs. barista tablet). - **Menu Service** (read-heavy): per-store menu + pricing/modifier config. Document store, fronted by Redis + CDN, version-based invalidation. - **Order Service** (transactional core): order creation, pricing, the state machine, cancellation. Owns the source-of-truth `orders` table and an **outbox** for reliable event emission. - **Payment Service**: owns payment intents/results, talks to the PSP, and consumes PSP webhooks as the source of truth for online capture. Also records **pay-at-store capture** when a barista marks an order paid. - **Queue/Projection Service**: consumes order events into a fast, store-scoped read model for baristas (Redis sorted sets keyed by store). - **Realtime Gateway**: holds customer/barista WebSocket/SSE connections, pushes status changes from the bus. - **Notification Service**: push/SMS/email on `READY` and `CANCELLED`. - **Event Bus**: decouples writes from projections, notifications, analytics. **Why split Order and Payment?** Different consistency and failure profiles — payment involves a slow external PSP and webhook callbacks, and we never want a PSP timeout to roll back a created order. The **outbox + event bus** lets the order commit locally while the rest happens asynchronously and reliably. --- ### 4) Data model Relational for orders/payments (transactions, clear invariants); a document store for the flexible, deeply-nested menu. ```sql -- Catalog (per store; many fields store-overridable) stores(store_id PK, name, timezone, address, tax_rate, hours_json, is_online BOOL, capacity_config_json) menu_items(item_id PK, store_id FK, name, category, base_price_cents, is_available BOOL, version BIGINT) modifiers(mod_id PK, item_id FK, name, price_delta_cents, group_name, -- e.g. "Milk", "Size" selection_rule) -- one-of | many-of | required -- Orders (sharded by store_id) orders(order_id PK, user_id, store_id FK, status, fulfillment_type, -- PICKUP | IN_STORE pay_mode, -- ONLINE | AT_STORE subtotal_cents, tax_cents, discount_cents, total_cents, currency, idempotency_key UNIQUE, -- dedupe key for create requested_for TIMESTAMPTZ NULL, -- reserved for order-ahead created_at, updated_at) order_lines(line_id PK, order_id FK, item_id, qty, unit_price_cents, -- snapshot at order time modifiers_json, -- chosen modifiers + deltas, snapshotted line_total_cents) -- One row per capture, ONLINE or AT_STORE: the single "money moved" record. payments(payment_id PK, order_id FK, provider, -- 'stripe' (online) | 'in_store' (POS) intent_id, -- PSP intent; NULL for in_store provider_ref UNIQUE, -- PSP charge id OR POS receipt id; dedupe on this status, -- REQUIRES_PAYMENT | CAPTURED | FAILED | REFUNDED captured_by, -- barista actor_id, for in_store amount_cents, currency, created_at, updated_at) -- Audit / event log of status (append-only) order_events(event_id PK, order_id FK, event_type, from_status, to_status, actor_id, actor_role, payload_json, created_at) -- Reliable event emission (written in the same txn as the order) outbox(id PK, aggregate_id, topic, payload_json, published BOOL, created_at) ``` **Design notes** - **Prices are snapshotted** onto `order_lines`/`orders` at creation. A later menu price change can't alter a placed order — a correctness requirement, not an optimization. - `idempotency_key UNIQUE` — the database, not application logic, is the final arbiter of "duplicate." - `provider_ref UNIQUE` — a redelivered PSP webhook (or a re-tapped POS "mark paid") can't capture twice. - **Pay-at-store capture is a real `payments` row**: when the barista collects cash/card and marks paid, the Order Service writes `provider='in_store'`, `status='CAPTURED'`, and a POS-receipt `provider_ref`. This gives both payment paths the same "money moved" backing record, so spend queries (§13) key off `payments` uniformly. - Modifiers stored as a JSON snapshot on the line — the catalog `modifiers` table can change freely without rewriting history. --- ### 5) Key APIs ``` # Menu (cacheable, ETag/version on response) GET /v1/stores/{storeId}/menu -> menu + modifiers + prices # Order POST /v1/orders (Idempotency-Key header required) body: { store_id, fulfillment_type, items:[{item_id, qty, modifier_ids:[...]}], payment_method } resp: { order_id, status, subtotal, tax, discount, total, client_secret? } GET /v1/orders/{orderId} -> full order + status POST /v1/orders/{orderId}/cancel -- CUSTOMER self-service cancel, only while PLACED or PAID (before the barista starts it). Cancelling IN_PROGRESS is a STAFF /transition. # Payment POST /v1/payments/webhook -- PSP -> us; signature-verified PaymentCaptured / PaymentFailed # Barista / status (RBAC: barista role, scoped to their store) GET /v1/stores/{storeId}/queue?status=PLACED,IN_PROGRESS,READY POST /v1/orders/{orderId}/transition body: { to_status, expected_status } -- the ONE authoritative state-change endpoint. Staff advance status here, -- and cancel an IN_PROGRESS order ({to_status: CANCELLED}) -> kicks off refund. # Realtime WS /v1/orders/{orderId}/stream -- customer streams their order's events WS /v1/stores/{storeId}/queue/stream -- barista tablet streams the live queue ``` Server **re-prices** on `POST /orders` from authoritative store config — client totals are advisory only. Both `/cancel` and `/transition` carry/derive `expected_status` for optimistic concurrency (compare-and-set). There is exactly **one** place transitions are validated and applied — the Order Service's `transition()` (§7) — and both endpoints funnel into it, so there's no looser cancel path that could bypass the state machine. --- ### 6) Core flows **A) Place order with online payment** ``` 1. Client POST /orders with Idempotency-Key. 2. Order Service: a. INSERT ... ON CONFLICT (idempotency_key) DO NOTHING. If conflict: load and return the existing order (idempotent retry). b. Validate items belong to the store & are available; re-price server-side. c. In ONE txn: orders(status=PLACED, pay_mode=ONLINE) + order_lines + order_events + outbox('OrderPlaced'). COMMIT. 3. Payment Service creates a PSP payment intent; returns client_secret. 4. Client confirms with the PSP directly (SCA/3DS happens client-side). 5. PSP -> POST /payments/webhook (PaymentCaptured). Verify signature, dedupe on provider_ref, write payments(provider='stripe', CAPTURED), transition PLACED->PAID, emit OrderStatusChanged. 6. Queue projection adds the order to the store's barista queue (after PAID, for online-pay orders). ``` **B) Pay-at-store**: skip steps 3–5; create the order `pay_mode=AT_STORE`, queued at `PLACED`. The state machine allows `PLACED → IN_PROGRESS` without `PAID` for this path. At handoff the barista marks it paid: the Order Service writes a `payments` row (`provider='in_store'`, `CAPTURED`, `captured_by=barista`, POS-receipt `provider_ref`) **in the same transaction** as the `READY → COMPLETED` transition. So a pay-at-store order never carries a `PAID` order-status, yet "money moved" is recorded when cash/card is collected — a completed order has a CAPTURED payment, one abandoned in the queue does not. **C) Status update** ``` Barista POST /orders/{id}/transition {to_status: IN_PROGRESS, expected_status: PAID} -> validate transition against the state machine, compare-and-set on current status, append order_events, write outbox. -> OrderStatusChanged flows to Realtime Gateway (customer sees it) and Notification Service (push when READY). ``` --- ### 7) The order state machine (the hard part) Transitions are **explicit and enforced server-side**; clients never set status directly. A transition table plus a compare-and-set prevents two baristas double-advancing the same order. ```python ALLOWED = { "PLACED": {"PAID", "IN_PROGRESS", "CANCELLED"}, # IN_PROGRESS only if pay_at_store "PAID": {"IN_PROGRESS", "CANCELLED"}, "IN_PROGRESS": {"READY", "CANCELLED"}, # staff cancel here triggers refund "READY": {"COMPLETED"}, # a made drink isn't cancellable "COMPLETED": set(), # terminal "CANCELLED": set(), # terminal } def transition(order_id, to_status, expected_status, actor): # 1. Validate BEFORE touching the DB. .get(..., set()) so a terminal/unknown # expected_status is a clean rejection, not a KeyError. Nothing written yet. if to_status not in ALLOWED.get(expected_status, set()): raise InvalidTransition(expected_status, to_status) # 2. Atomic compare-and-set: advance only if DB status still matches expected. # The WHERE clause is the concurrency guard. rows = db.execute(""" UPDATE orders SET status = :to, updated_at = now() WHERE order_id = :id AND status = :expected RETURNING order_id """, to=to_status, id=order_id, expected=expected_status) if not rows: # someone else moved it, or stale read raise ConflictError("status changed concurrently") # 3. Audit + outbox in the SAME transaction as the UPDATE (§9). append_event(order_id, expected_status, to_status, actor) write_outbox(order_id, "OrderStatusChanged", {"to": to_status}) ``` Validation comes **first**, so an illegal target (e.g. `READY` from `PAID`) is rejected before any row is written; the DB never holds an illegal status. The `WHERE status = :expected` clause is the concurrency guard — if a teammate already advanced the order, the `UPDATE` touches 0 rows and we reject the stale transition. Validity check + UPDATE + audit + outbox run in one transaction, so a failure at any step leaves the order untouched. **Cancellation scope**: `CANCELLED` is reachable from `PLACED`, `PAID`, and `IN_PROGRESS` — every non-terminal state up to and including `IN_PROGRESS`, but **not** from `READY` (a freshly-made drink is completed or written off out-of-band, not cancelled). This is a deliberate narrowing of the prompt's literal `READY → COMPLETED/CANCELLED`, which I'd call out to the interviewer. The customer `/cancel` endpoint only permits the `PLACED`/`PAID` window; cancelling `IN_PROGRESS` is a staff `/transition`, and that `IN_PROGRESS → CANCELLED` on a paid order is what kicks off a PSP refund. --- ### 8) Real-time status delivery Updates must land within seconds over long-lived, flaky mobile connections. - **Transport**: WebSocket for the barista tablet (bidirectional, persistent); **SSE** is simpler and sufficient for the customer's one-way status stream. Fall back to **short polling** of `GET /orders/{id}` if the socket drops. - **Fan-out path**: `OrderStatusChanged` → event bus → Realtime Gateway nodes. Customer connections keyed by `order_id`; barista connections subscribe to a `store_id` channel. - **Stateless gateways + Redis pub/sub**: any gateway node can hold any connection, so publish to a Redis pub/sub channel per `store_id` (and per `order_id`); the publisher needn't know which node holds the socket. $\sim 30\text{K}$ concurrent connections is comfortable. - **Catch-up on reconnect**: the stream returns the **current** order status on connect (read from `orders`), so a client that missed an event while offline is immediately correct. Events carry a monotonically increasing per-order sequence; a client can request a replay from `order_events`. --- ### 9) Consistency, idempotency & failure handling This is where the design earns its keep. - **Idempotent order create**: client supplies an `Idempotency-Key`; the `UNIQUE` constraint makes "create the same order twice" physically impossible. Double-click / network-retry returns the same `order_id`. - **Effectively-once payment capture**: PSP webhook delivery is at-least-once, so make the *consumer* idempotent rather than claiming exactly-once delivery. Treat **PSP webhooks as the source of truth** for online payments, never the client's "success" callback (it can lie or drop). Dedupe on `UNIQUE provider_ref` — a redelivered `PaymentCaptured` is a no-op, net effect one capture. Pay-at-store "mark paid" dedupes the same way (POS receipt id = `provider_ref`). **Verify the webhook signature** before trusting it. - **Atomic write + reliable events (outbox)**: the order row and its `outbox` event are written in the **same DB transaction**, so an order can't commit without its event (or vice versa). A relay polls `outbox` (or via CDC/logical decoding), publishes to the bus, and marks rows `published`. This avoids the dual-write problem (DB commit succeeds, publish fails). - **State machine + compare-and-set**: invalid transitions rejected in-process before any write; concurrent transitions rejected by the DB compare-and-set (§7). - **PSP/order divergence**: an order can be `PLACED` but never confirm (customer abandons 3DS). A sweeper expires unpaid online orders after $N$ minutes (`PLACED → CANCELLED`), releasing the slot. Conversely, if payment captured but the order row didn't update, the webhook handler is the reconciliation point and retries the transition. - **Refunds**: cancelling a `PAID`/`IN_PROGRESS` online order emits a refund command; the order reaches `CANCELLED` only after the refund is acknowledged (or is marked `REFUND_PENDING`). For pay-at-store, a cancel before capture moves no money (no `payments` row yet) — nothing to refund. - **Retries**: all write endpoints are safe to retry (idempotent); reads are trivially retryable. --- ### 10) Scaling to many stores The natural partition key is the **store** — orders, queues, and barista views are almost always store-scoped. - **Shard `orders` by `store_id`** (or a hash). All of a store's reads/writes hit one shard; no cross-shard query on the hot path. Adding stores adds shards horizontally. - **Per-store queue projections**: a **Redis sorted set per store**, score = priority/`created_at`, gives baristas $O(\log N)$ insert and $O(\log N + k)$ range reads — fast and isolated. Rebuildable from `order_events` if Redis is lost. - **Menu at the edge**: per-store, rarely-changing menus served from CDN with version/ETag; invalidate by bumping the store's menu `version`. Near-100% hit ratio removes menu reads from the origin. - **Realtime gateways scale horizontally** (stateless + Redis pub/sub), partitioned by store channel. - **Async everything non-critical**: notifications, analytics, projections are bus consumers, so a slow notification provider never slows order creation. - **Multi-region (later)**: pin a store's data to the nearest region; cross-region orders are rare, keeping latency low without global consensus. **Bottleneck honesty**: the real hot spots are (1) the Payment Service ↔ PSP path (external, slow — must be async + idempotent) and (2) realtime connection count at peak (mitigated by stateless gateways). Order writes at $\sim 100/\text{s}$ are not a bottleneck for a sharded relational store. --- ### 11) Security, privacy, observability - **AuthZ (RBAC)**: customers see only their orders; baristas are scoped to their store's queue; admins manage menu/stores. Enforced at the gateway and re-checked in services. - **Payments**: never store raw card data — tokenize via the PSP (PCI scope minimization). Verify every webhook signature. - **PII**: minimize and encrypt customer contact fields at rest; redact in logs. - **Observability**: order-create p99, payment success rate, **time-in-status per stage** (SLA on `PLACED→READY`), per-store queue length, webhook failure rate, outbox publish lag. Distributed tracing across Order → Payment → Queue. Alert on queue-length spikes (a store falling behind) and outbox lag (events not flowing). --- ### 12) Edge cases & follow-up scenarios - **Item sells out after order**: barista marks unavailable → order offered cancel/refund or substitution. - **Lost PSP webhook — payment captured, order stuck (follow-up #3)**: the PSP captured the money but the `PaymentCaptured` webhook never arrived, so the order is stranded at `PLACED`. Two mechanisms reconcile it, both already in the design (§9): (1) the **webhook handler is the reconciliation point** — when a later/redelivered webhook does land it re-runs the `PLACED → PAID` transition, and because capture is idempotent on `provider_ref` and the transition is a compare-and-set, replay is safe; (2) for the case where *no* webhook ever arrives, a **reconciliation sweeper** periodically pulls payment status from the PSP (or its events API) for orders that have been `PLACED` past a threshold, and on finding a CAPTURED charge writes the `payments` row and drives the same `PLACED → PAID` transition. The unpaid-order sweeper (§9) handles the inverse — genuinely abandoned orders are expired `PLACED → CANCELLED`. Net: a lost webhook never permanently strands a paid order. - **Store loses connectivity during the rush (follow-up #4, part 1)**: set `is_online=false` and reject new *online* orders for that store with a clear message (don't silently queue payments we can't process). The **in-store POS keeps taking orders locally** in a local buffer with locally-generated order ids that already carry an `idempotency_key`; on reconnect it replays them to the cloud. Because create is idempotent on `idempotency_key` and capture is idempotent on `provider_ref`, replaying a batch can't duplicate orders or double-charge — orders the cloud already saw are no-ops. No order or payment is lost; at worst it's delayed until reconnect. - **Scheduled order-ahead (follow-up #4, part 2)**: the `requested_for` field already reserves this. The change is mostly in *scheduling*, not the core model: don't inject the order into the live barista queue until `requested_for - lead_time`; the Queue/Projection Service uses `requested_for` (not `created_at`) as the sorted-set score so the queue is ordered by promised time; a scheduler fires the "make it now" event near pickup; capacity config per store caps how many ahead-orders a slot accepts. Payment can be captured up front or at pickup. The state machine is unchanged. - **Duplicate / out-of-order events**: handled by `provider_ref` uniqueness and per-order event sequence numbers. - **Timezone/tax per store**: ready-time estimates and tax come from the store's `timezone`/`tax_rate`, never the client's locale. --- ### 13) Follow-up: the two SQL questions Because the design records **every capture — online and pay-at-store — as a `CAPTURED` row in `payments`** (§4), "money that actually moved" has one uniform definition across both payment paths. I key the spend queries off `payments`. (If asked to answer over `orders` alone, I'd flag that order-status is *not* a sound proxy for spend: a pay-at-store order reaches `IN_PROGRESS`/`READY`/`COMPLETED` whether or not cash was collected, so no `orders.status` set cleanly equals "paid." That's exactly why real spend must look at captures.) **(a) The user who spent the most last month** ```sql SELECT o.user_id, SUM(p.amount_cents) AS spend_cents FROM payments p JOIN orders o ON o.order_id = p.order_id WHERE p.status = 'CAPTURED' -- money moved (online OR in_store) AND p.created_at >= date_trunc('month', current_date) - interval '1 month' AND p.created_at < date_trunc('month', current_date) -- half-open [first-of-last-month, first-of-this-month) GROUP BY o.user_id ORDER BY spend_cents DESC LIMIT 1; ``` Summing `payments.amount_cents` (not `orders.total_cents`) also makes partial refunds fall out naturally if we later record `REFUNDED`/partial-capture rows — spend follows the money, not the order header. (Use `LIMIT 1 WITH TIES` or a window function if ties matter.) **(b) Users who have never spent anything** (registered but no captured payment, ever) The question is a **per-user** existence test ("does this user have *any* captured payment?"), so the anti-join must be expressed per user — `NOT EXISTS` is the clean, correct shape: ```sql SELECT u.user_id FROM users u WHERE NOT EXISTS ( SELECT 1 FROM orders o JOIN payments p ON p.order_id = o.order_id WHERE o.user_id = u.user_id AND p.status = 'CAPTURED' -- same "captured = spend" rule as (a) ); ``` **Why not the chained `users → orders → payments` LEFT JOIN + `IS NULL`?** It's the obvious-looking form, and it's *wrong* here. Joining out to per-order rows means a user with two orders — one captured, one not — produces a surviving `payment_id IS NULL` row for the un-captured order, so a user who *did* spend is returned as "never spent." The LEFT-JOIN anti-join only works if you first collapse to one row per user (e.g. anti-join against a `DISTINCT user_id` set of spenders), which `NOT EXISTS` does implicitly. If you do write it as a LEFT JOIN, anti-join against the **per-user spender set**, not the per-order payment rows: ```sql SELECT u.user_id FROM users u LEFT JOIN ( SELECT DISTINCT o.user_id FROM orders o JOIN payments p ON p.order_id = o.order_id WHERE p.status = 'CAPTURED' ) spenders ON spenders.user_id = u.user_id WHERE spenders.user_id IS NULL; ``` Two further classic traps worth naming: the `status = 'CAPTURED'` predicate must live in the spender subquery (an inner `JOIN` here), **not** in an outer `WHERE` on the right side of a `LEFT JOIN` — that would silently collapse the outer join into an inner one and drop the users we want. And prefer `NOT EXISTS` over `NOT IN (SELECT ...)`, which returns *no rows at all* if the subquery yields any `NULL`. (Assumes a `users` table — a standard interview assumption.) --- ### Summary A small set of services around a transactional **Order Service** (sharded by `store_id`), a decoupled **Payment Service** that trusts PSP webhooks for online pay and records a POS capture row for pay-at-store, and a fast **per-store Redis queue projection** for baristas. Correctness is bought cheaply with three primitives: a `UNIQUE` idempotency key on order create, a `UNIQUE provider_ref` for effectively-once capture on **both** payment paths, and an **outbox** so events never diverge from the DB. State changes — including the only cancellation that triggers a refund, `IN_PROGRESS → CANCELLED` — funnel through one server-side `transition()` guarded by a compare-and-set. Real-time status rides the event bus to stateless WebSocket/SSE gateways. Throughput ($\sim 100$ writes/s) is easy; the real engineering is idempotency, the payment/PSP failure surface, store-offline replay, and per-store realtime fan-out — and that's where the design concentrates its effort.

Related Interview Questions

  • Design a TikTok-like short video platform - Salesforce (medium)
  • Design a scalable coffee ordering notification system - Salesforce (medium)
  • Design a configurable monthly API rate limiter - Salesforce (medium)
  • Design a pipeline orchestration system on Kubernetes - Salesforce (medium)
  • Design an async job system and cache layer - Salesforce (hard)
Salesforce logo
Salesforce
Jan 22, 2026, 12:00 AM
Software Engineer
Technical Screen
System Design
755
0

System Design: Coffee Ordering System

Design a system for a coffee shop (and later a chain of stores) that lets customers order drinks for pickup or in-store consumption, and lets baristas work the order queue.

This was a time-boxed (~30 minute) system design round, so prioritize the core ordering flow, the order state machine, and how it scales from one store to many — depth over breadth.

The system must support these core use cases:

  • A customer browses a store's menu and places an order for pickup or in-store .
  • The customer customizes each item (size, milk type, add-ons) and sets a quantity per item.
  • The system computes the total price (line totals, tax, any discounts) server-side .
  • Payment is either processed online or the order is marked pay-at-store .
  • Baristas see a live queue of incoming orders and advance each order's status through a state machine: PLACED → PAID (optional) → IN_PROGRESS → READY → COMPLETED / CANCELLED .
  • The customer sees near-real-time status updates and is notified when the drink is ready.

Walk through the APIs/events, the data model, the high-level architecture and major services, how you keep the system correct under retries and failures, and how you scale from one store to many.

Constraints & Assumptions

  • Start with one store , then extend to many stores (and eventually multiple regions).
  • Peak load: thousands of orders/minute across all stores — treat this as roughly ≈100\approx 100≈100 orders/sec at peak. Assume reads (menu browsing, status checks) dominate writes, on the order of a 50:150{:}150:1 ratio.
  • Status updates should feel near-real-time (low single-digit seconds) to customers.
  • Money must be correct: no duplicate orders, no double charges , even under client retries and at-least-once webhook delivery. Correctness for money and order submission takes priority over availability.
  • Pricing and tax are computed per store (sales tax varies by jurisdiction).
  • Menus change rarely and are highly cacheable; prices on a placed order must not change retroactively.
  • Out of scope unless you choose to mention hooks: inventory/supply chain, loyalty accrual, delivery/driver dispatch, POS hardware specifics.

Clarifying Questions to Ask

  • Is pricing and tax store-specific (sales tax varies by jurisdiction)? Does each store have its own menu and availability?
  • Do we need order-ahead with a scheduled pickup time , or only "make it now"?
  • For pay-at-store orders, when is the order allowed to start being made — before or only after payment is collected, and who records that money was taken?
  • Should an order be cancellable after the barista has started it, and who is allowed to cancel (customer vs. staff)?
  • Single region first, or multi-region from day one?
  • Roughly how many stores are we targeting, and what read:write ratio should we design around?

What a Strong Answer Covers

  • Requirements split into functional vs. non-functional, with explicit, sensible scoping of what's in and out.
  • Back-of-envelope sizing that justifies component choices (write QPS, read QPS, storage growth, concurrent realtime connections).
  • A clear, server-enforced order state machine with explicit allowed transitions and concurrency control (two baristas can't double-advance the same order).
  • Idempotent order creation and effectively-once payment capture , with the PSP webhook (not the client) treated as the source of truth for online payments.
  • A data model that snapshots prices/modifiers onto the order so later menu edits don't change historical orders, with idempotency encoded as DB constraints.
  • Reliable event propagation (e.g. outbox + event bus ) and a justified read/write split, with menus served from cache/CDN.
  • A realtime delivery strategy (WebSocket/SSE, reconnect/catch-up behavior, polling fallback) sized against concurrent in-flight orders.
  • A coherent sharding/partitioning story (by store) and honest identification of the real bottlenecks (payment path, realtime connection count) versus the modest raw write throughput.
  • Failure handling : abandoned payments, refunds on cancellation, store-offline behavior, duplicate/out-of-order events.
  • Security, privacy, and observability : RBAC, PCI scope minimization, the key SLOs and alerts.

Follow-up Questions

After the design, the interviewer for this round asked two SQL questions against the resulting schema, then two scenario follow-ups on the design:

SQL against the schema:

  1. Write a query to find the user who spent the most last month . What does "spend" mean precisely given your data model?
  2. Write a query to find users who have never spent anything . How do you avoid the classic NULL and LEFT JOIN -becomes-inner-join traps?

Scenario follow-ups on the design:

  1. A payment is captured by the PSP but the webhook that would advance the order to PAID is lost. How does the order eventually reconcile?
  2. How would you handle a store losing connectivity to the cloud during the morning rush without losing or duplicating orders, and how does your design change to support scheduled order-ahead (e.g. "have it ready at 8:15am")?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Salesforce•More Software Engineer•Salesforce Software Engineer•Salesforce System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.