Payment Systems: Ledgers, Idempotency, and Reconciliation

What's being tested

Interviewers are probing whether you can design money-moving systems that remain correct under retries, partial failures, duplicate messages, delayed callbacks, and inconsistent external providers. For a Software Engineer, the focus is not finance strategy; it is distributed-systems correctness, data modeling, API semantics, and operational recovery. Google cares because payments-like systems appear anywhere durable state must be exact: billing, ads spend, credits, quotas, refunds, subscriptions, and marketplace payouts. Strong answers separate the source of truth from derived views, define clear state transitions, and explain how to recover when systems disagree.

Core knowledge

Double-entry ledger design records every movement as balanced debit and credit entries, not as mutable account balances. A core invariant is: $\sum debits = \sum credits$ for each transaction batch. Balances should be materialized views over immutable journal entries, not the primary source of truth.
Ledger immutability is the foundation of auditability. Instead of updating “payment amount from $10 to$ 8,” append a correcting entry or reversal. This makes reconciliation, dispute handling, and historical debugging possible even months later.
Idempotency keys prevent duplicate side effects when clients retry after timeouts. A common pattern, used by systems like Stripe, is storing (merchant_id, idempotency_key) -> request_hash, response, status. Same key plus same payload returns the original response; same key plus different payload should fail with 409 Conflict.
Exactly-once effects are usually implemented with at-least-once delivery plus idempotent writes, not magical exactly-once networking. For example, POST /payments may be retried, webhooks may be delivered multiple times, and queue consumers may crash after committing to Postgres but before acknowledging a Kafka message.
Payment state machines should be explicit and monotonic where possible: CREATED -> AUTHORIZED -> CAPTURED -> SETTLED, or CREATED -> FAILED, with separate refund states. Avoid ambiguous booleans like paid=true; model terminal states, retryable states, and external provider references.
Outbox pattern helps coordinate database writes with asynchronous publishing. Write the ledger entry and an outbox_events row in one Postgres transaction, then a relay publishes to Pub/Sub or Kafka. This avoids the classic failure where the DB commit succeeds but the event publish fails.
Reconciliation compares internal records against external truth sources such as processor settlement files, bank reports, or provider APIs. Match on stable identifiers like provider_charge_id, amount, currency, merchant, and settlement date. Expect timing gaps, partial captures, fees, chargebacks, and currency rounding differences.
Balance computation should separate correctness from performance. The canonical balance is SUM(entries.amount) over immutable entries; for scale, maintain cached balances updated transactionally. For accounts with millions of entries, snapshot periodically and compute snapshot_balance + SUM(entries after snapshot).
Concurrency control matters when multiple requests affect the same account. Use database transactions, row-level locks, optimistic version checks, or compare-and-swap semantics. For example, decrementing prepaid credits should be atomic: UPDATE accounts SET balance = balance - x WHERE id = ? AND balance >= x.
Currency handling must avoid floating point. Store amounts as integers in the smallest currency unit, such as cents, or as fixed-precision decimals for currencies with nonstandard minor units. Include currency on every amount; never add USD and EUR balances without an explicit FX event.
Failure modes should be first-class design cases: client timeout after success, provider accepts payment but callback is delayed, duplicate webhook arrives, internal DB commits but worker crashes, settlement file has an unknown transaction, or refund succeeds externally but internal update fails.
Observability and audit trails should expose transaction IDs, idempotency keys, provider IDs, ledger batch IDs, and state transitions. Useful metrics include duplicate request rate, reconciliation mismatch count, unreconciled amount, pending settlement age, and payment state transition latency p99.

Worked example

Design a payment ledger for a checkout system

A strong candidate would first clarify scope: “Are we processing cards directly or integrating with a provider? Do we need auth/capture/refund, multiple currencies, and merchant accounts? What is the required source of truth for balances?” Then they would declare assumptions: use an external processor for card movement, keep an internal immutable ledger in Postgres, and treat provider callbacks as asynchronous and potentially duplicated.

The answer should be organized around four pillars: API flow, ledger model, idempotency, and reconciliation. For the API flow, POST /payments accepts an idempotency key, creates a payment record, calls the provider, and records state transitions. For the ledger model, every captured payment creates balanced entries, for example debit buyer_cash_pending and credit merchant_receivable, then later move from pending to settled. For idempotency, the system stores the original request hash and response, so a retry after timeout cannot create a second charge.

A key tradeoff to flag is synchronous versus asynchronous confirmation. Returning only after provider capture gives a simpler client experience but higher latency and timeout risk; returning PENDING with later webhook completion improves resilience but requires clients to handle eventual consistency. A strong close would add: “If I had more time, I would detail refund and chargeback reversal entries, reconciliation jobs against settlement files, and operational dashboards for stuck payments.”

A second angle

Make a payment API safe under retries

Here the framing shifts from full system design to API semantics and write-path correctness. The core idea is still preserving one financial effect per logical request, but the candidate should go deeper on the idempotency table, unique constraints, and transaction boundaries. A good design uses (caller_id, idempotency_key) as a unique key, stores request fingerprint and final response, and makes the payment creation and idempotency record update atomic. The edge case to discuss is concurrent duplicate requests: two identical retries may arrive simultaneously, so the system needs INSERT ... ON CONFLICT, row locks, or a PROCESSING state with safe polling behavior. The same ledger principle applies: even if the API response is retried, the ledger entries must be created exactly once.

Common pitfalls

Pitfall: Treating account balance as the source of truth.

A tempting answer is “store balance on the user row and update it after each payment.” That can work as a cache, but it is insufficient for payments because you cannot audit why the balance changed. A stronger answer says balances are derived from immutable ledger entries, with cached balances updated transactionally for performance.

Pitfall: Saying “use exactly-once messaging” without explaining the write path.

Interviewers will push on what happens when a worker writes to the database and crashes before acknowledging the message. A better answer acknowledges at-least-once delivery, then uses idempotent consumers, unique constraints, outbox events, and replay-safe handlers.

Pitfall: Ignoring external reconciliation.

Many candidates stop after “payment succeeded in our DB.” Real systems must compare internal state against processor settlement records because callbacks can be delayed, dropped, duplicated, or semantically different from final settlement. A strong candidate explicitly designs mismatch detection and manual or automated repair paths.

Connections

Interviewers may pivot from here into distributed transactions, saga orchestration, database isolation levels, event-driven architecture, or API design for retries. They may also ask about scaling hot accounts, multi-region consistency, refund workflows, or how to debug a production incident where customers were charged twice.