Distributed Systems Correctness And Idempotency

What's being tested

Interviewers are probing whether you can design distributed systems that remain correct under retries, timeouts, partial failures, duplicate messages, and concurrent updates. For OpenAI-scale systems, this matters anywhere usage, credits, payments, webhooks, job scheduling, or model-serving resources must not be double-counted or silently lost. A strong Software Engineer answer separates user-facing availability from internal correctness guarantees, defines invariants, and explains how APIs, storage, queues, and reconciliation work together. The bar is not “say exactly-once”; it is to know where exactly-once is impossible, where idempotency is sufficient, and where transactional boundaries or ledgers are required.

Core knowledge

Idempotency means applying the same operation multiple times has the same externally visible effect as applying it once. Common API pattern: require an Idempotency-Key, store request hash + response, and return the original result on retry rather than executing again.
Exactly-once semantics are usually an end-to-end illusion built from at-least-once delivery, deduplication, and atomic state transitions. Networks duplicate and drop messages; correctness comes from durable identifiers, transactional writes, and consumers that tolerate replay.
At-least-once delivery is the practical default for queues like Kafka, SQS, or Pub/Sub: producers retry until acked, consumers may reprocess after crash. Every message handler that mutates state should be idempotent or guarded by a unique operation record.
At-most-once delivery avoids duplicates by not retrying after uncertainty, but risks data loss. It is acceptable for telemetry or best-effort notifications, not for payments, GPU credit accounting, quota enforcement, or ledger mutations.
Deduplication keys must be scoped carefully: merchant_id + idempotency_key for payments, tenant_id + job_id + allocation_epoch for credits, or source_node + seq_no for messaging. Include a TTL only if the business can tolerate replay after expiry.
Transactional outbox avoids the dual-write problem where a service updates Postgres but crashes before publishing to Kafka. Write the domain row and outbox_events row in one database transaction; a relay publishes and marks sent, while consumers dedupe by event id.
Inbox tables are the consumer-side companion to outbox. Insert message_id into processed_messages in the same transaction as the business mutation; if the insert conflicts on a unique constraint, skip the duplicate safely.
Compare-and-swap and optimistic concurrency control protect shared resources. Use version columns, conditional updates, or etcd leases: UPDATE credits SET balance = balance - x, version = version + 1 WHERE id = ? AND version = ? AND balance >= x.
Ledger models are better than mutable balances for money and usage accounting. Append immutable debit/credit entries with a unique operation_id; derived balance is $\sum credits - \sum debits$ . For high read volume, maintain snapshots but treat the ledger as the source of truth.
Consistency level should match the invariant. Strong consistency is needed for “never overspend balance”; eventual consistency is fine for dashboards or webhook delivery status. A common design uses strongly consistent writes on a small critical path and async fanout for side effects.
Reconciliation is part of correctness, not an afterthought. Build jobs that compare internal ledgers with payment provider settlements, GPU allocator reservations with actual usage, or webhook delivery attempts with terminal states. Track metrics like duplicate rate, retry rate, reconciliation deltas, and stuck operations.
Failure-state machines make distributed operations explainable. Payment states might be CREATED → AUTHORIZED → CAPTURED → SETTLED or FAILED; transitions are monotonic, persisted, and tied to external provider ids. Avoid ambiguous “pending” states without timeout and recovery behavior.

Worked example

For Design a scalable payment processor, a strong candidate should first ask what “payment processor” owns: payment intent APIs, provider integrations, ledgering, refunds, webhooks, and reconciliation, but not card network internals. They should clarify scale, currencies, latency targets, whether multiple payment providers are needed, and the key invariant: a user request must not charge the customer twice, and the internal ledger must match external settlement.

A clean answer can be organized around four pillars. First, define the API: POST /payments requires Idempotency-Key, amount, currency, merchant/user id, and returns a stable payment_id; retries with the same key return the same result if the request body hash matches. Second, model storage: payments, payment_attempts, ledger_entries, and outbox_events in a strongly consistent store like Postgres or a sharded SQL system, with unique constraints on idempotency_key and operation_id.

Third, describe the execution path: create the payment intent transactionally, call the external provider with a provider idempotency key, persist provider response, append ledger entries only once, and publish events through an outbox. Fourth, cover async reliability: webhook ingestion from providers is also idempotent, consumers use inbox dedupe, and reconciliation jobs compare provider settlement files against internal ledger state.

One tradeoff to flag explicitly is synchronous versus asynchronous capture. Synchronous capture gives a simpler user experience but couples availability to provider latency; asynchronous capture improves resilience but requires a state machine and polling/webhook recovery. A strong close would be: “If I had more time, I’d go deeper on shard keys for merchant-scale isolation, PCI tokenization boundaries, refund/chargeback state machines, and operational runbooks for provider outages.”

A second angle

For Implement node messaging and path discovery, the same correctness principles show up without money or ledgers. Each node should attach a message id, source id, and sequence number so adjacent nodes can retry sends without causing duplicate processing. Path discovery needs a visited set or distance table keyed by node/message to avoid loops, and the protocol should define what happens when links fail, messages arrive out of order, or acknowledgments are lost.

The framing differs because the core invariant is not “money is conserved” but “eventually discover reachable paths without infinite forwarding or false confirmation.” You would talk more about graph traversal, TTL/hop limits, ack/nack behavior, and message complexity, e.g. flooding can be $O(E)$ per discovery while maintaining per-destination routing state reduces repeated work. The transferable idea is the same: stable operation identifiers plus monotonic state transitions beat hoping the network behaves.

Common pitfalls

Pitfall: Claiming “use Kafka for exactly-once” as if it solves end-to-end correctness.

Kafka transactions can help within Kafka producer/consumer workflows, but they do not make an external payment API, database write, email send, or GPU allocation exactly-once. Say what the atomic boundary is, then explain dedupe and recovery outside that boundary.

Pitfall: Treating idempotency as just “retry the same request.”

Retries without a persisted idempotency record can still double-charge, double-debit, or double-allocate if the first attempt succeeded but the response was lost. A better answer names the unique key, the storage table, the request-hash check, the response replay behavior, and the TTL or retention policy.

Pitfall: Designing only the happy path.

A shallow answer says “service writes to DB, publishes event, worker processes it.” A stronger answer injects failures: DB commit succeeds but publish fails, worker crashes after side effect, provider webhook arrives before local request completes, two clients race to spend the same credit balance, or reconciliation finds a mismatch.

Connections

Interviewers often pivot from this topic into distributed transactions, consensus, rate limiting, resource scheduling, or observability. Be ready to discuss 2PC versus sagas, Raft/Paxos-backed coordination, token buckets for quota enforcement, and operational metrics like p99 latency, duplicate rate, retry storm rate, and reconciliation lag.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts