Design a Payment Processing Service (Merchant to Payment Provider)
Company: OpenAI
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
You are designing the backend **payment processing service** that sits between online **merchants** and external **payment providers** (card networks and issuing/acquiring banks). A merchant's server calls your API to charge a cardholder; your service orchestrates the **authorization → capture → settlement** lifecycle with the upstream provider, records an immutable history of every money movement, and reliably reports the final outcome back to the merchant.
**Scope.** Model only the **merchant-to-provider** path. You do *not* need to design the consumer checkout UI, the card-entry form, or consumer-side fraud/behavior modeling. Assume the cardholder's card details arrive (as a token or PAN) with the merchant's request, and that one or more upstream providers exist that can actually move the money.
### Constraints & Assumptions
- Peak load ~**5,000 charge requests/second**; each charge is a small, durable write.
- Upstream providers are third parties: p99 latency from hundreds of milliseconds to a few seconds, with occasional timeouts, partial outages, and *unknown* outcomes.
- **Money correctness is the top priority**: never double-charge, never lose a successful charge, and the system must reconcile to the cent.
- Multiple acquirers/providers exist; a charge is routed to one based on card BIN, currency, or health.
- Merchant notifications (webhooks) must be **at-least-once** with eventual delivery.
- PCI scope must be minimized: prefer tokenization; never log raw card numbers.
### Clarifying Questions to Ask
- Which rails are in scope — cards only, or also ACH/wallets? (Assume cards.)
- Must authorization be **synchronous** in the merchant's API call, or is an async "pending" acknowledgment acceptable?
- One acquirer, or **multiple** with routing and failover?
- Do we own the card vault (PCI-DSS Level 1) or tokenize through the provider?
- Capture model: combined **sale** (auth + capture), or **separate** auth then later capture?
- Are refunds, voids, and partial captures in scope?
- Single currency or **multi-currency** settlement?
### Part 1 — High-level architecture and the synchronous charge (authorization) path
Sketch the end-to-end components and walk through a single **successful** charge from the merchant's API call to the response. Show where the request is authenticated and validated, how it reaches the provider, what state transitions occur, and exactly what the merchant sees **synchronously** versus what happens afterward.
```hint Where to start
Treat your service as an orchestrator in front of a slow, flaky third party: separate the fast "authenticate, validate, and record intent" step from the slow "talk to the provider" step.
```
```hint Latency boundary
Decide up front which steps must complete inside the merchant's HTTP call (the approve/decline decision) versus which can be deferred (capture, settlement, webhooks).
```
#### Clarifying Questions for this Part
- Does the merchant require a real-time approve/decline in the response, or is a durable "accepted/pending" acknowledgment acceptable for some flows?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Data model, API surface, and idempotency
Define the core API (create charge, capture, refund, get status) and the data model (charges, attempts, ledger). Then specify **exactly** how a retried or duplicated request never results in a double charge.
```hint Exactly-once
The network will retry on timeouts. Lean on a client-supplied idempotency key plus a uniqueness constraint, and persist the response so a retry replays the original result instead of re-charging.
```
```hint Model the entities
A *charge* and an *attempt* are different things — one charge may make several attempts against a provider. Model them separately.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Consistency, the ledger, and provider failures (the "unknown outcome")
The provider call can **time out with an unknown result** — you don't know whether the card was charged. Design how the system stays consistent: the source of truth, how unknown outcomes are resolved, and how asynchronous capture, settlement, and reconciliation work.
```hint A timeout is not a decline
Don't blindly re-authorize after a timeout — that risks a double charge. Design a status-inquiry / reconciliation path that first determines whether the charge actually happened.
```
```hint Source of truth
Make money movements an **append-only, double-entry ledger**; derive balances from it rather than mutating a balance field in place.
```
```hint Crossing two systems
Your DB and an external provider can't share one transaction, so two-phase commit doesn't apply — reach for the transactional **outbox**, idempotent retries, and reconciliation (eventual consistency).
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 4 — Scaling, reliability, and observability
Take the design to production scale. How do you handle thousands of charges/second, isolate a slow or failing provider, deliver webhooks reliably, and *know* the system is healthy and the money balances?
```hint Isolation
One slow upstream provider must not exhaust all your capacity — think bulkheads, bounded timeouts, retry budgets, circuit breakers, and per-provider queues.
```
```hint Webhooks
At-least-once delivery means a durable queue, retries with backoff + jitter, signed payloads, and consumer-side idempotency by event id.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- How would you support **separate authorization and delayed capture** (hold funds at order time, capture at shipment), including authorization expiry?
- A merchant resubmits with the **same idempotency key but a different amount** — what do you return, and why?
- How do you safely **retry a charge that timed out** without risking a double charge?
- Walk through end-of-day reconciliation: the provider's settlement file lists a charge your system has **no record of**. What happens?
Quick Answer: This system design question evaluates a candidate's ability to model a distributed payment processing service that coordinates authorization, capture, and settlement between merchants and external payment providers. It tests reasoning about correctness guarantees under network failure, idempotency, and reconciliation, common concerns for high-throughput financial systems. The question probes conceptual understanding of distributed transaction integrity as well as practical architectural trade-offs.