Distributed Systems Correctness And Idempotency
Asked of: Software Engineer
Last updated
What's being tested
Interviewers are probing whether you can design distributed systems that remain correct under retries, timeouts, partial failures, duplicate messages, and concurrent updates. For OpenAI-scale systems, this matters anywhere usage, credits, payments, webhooks, job scheduling, or model-serving resources must not be double-counted or silently lost. A strong Software Engineer answer separates user-facing availability from internal correctness guarantees, defines invariants, and explains how APIs, storage, queues, and reconciliation work together. The bar is not “say exactly-once”; it is to know where exactly-once is impossible, where idempotency is sufficient, and where transactional boundaries or ledgers are required.
Core knowledge
-
Idempotency means applying the same operation multiple times has the same externally visible effect as applying it once. Common API pattern: require an
Idempotency-Key, store request hash + response, and return the original result on retry rather than executing again. -
Exactly-once semantics are usually an end-to-end illusion built from at-least-once delivery, deduplication, and atomic state transitions. Networks duplicate and drop messages; correctness comes from durable identifiers, transactional writes, and consumers that tolerate replay.
-
At-least-once delivery is the practical default for queues like
Kafka,SQS, orPub/Sub: producers retry until acked, consumers may reprocess after crash. Every message handler that mutates state should be idempotent or guarded by a unique operation record. -
At-most-once delivery avoids duplicates by not retrying after uncertainty, but risks data loss. It is acceptable for telemetry or best-effort notifications, not for payments, GPU credit accounting, quota enforcement, or ledger mutations.
-
Deduplication keys must be scoped carefully:
merchant_id + idempotency_keyfor payments,tenant_id + job_id + allocation_epochfor credits, orsource_node + seq_nofor messaging. Include a TTL only if the business can tolerate replay after expiry. -
Transactional outbox avoids the dual-write problem where a service updates
Postgresbut crashes before publishing toKafka. Write the domain row andoutbox_eventsrow in one database transaction; a relay publishes and marks sent, while consumers dedupe by event id. -
Inbox tables are the consumer-side companion to outbox. Insert
message_idintoprocessed_messagesin the same transaction as the business mutation; if the insert conflicts on a unique constraint, skip the duplicate safely. -
Compare-and-swap and optimistic concurrency control protect shared resources. Use
versioncolumns, conditional updates, oretcdleases:UPDATE credits SET balance = balance - x, version = version + 1 WHERE id = ? AND version = ? AND balance >= x. -
Ledger models are better than mutable balances for money and usage accounting. Append immutable debit/credit entries with a unique
operation_id; derived balance is . For high read volume, maintain snapshots but treat the ledger as the source of truth. -
Consistency level should match the invariant. Strong consistency is needed for “never overspend balance”; eventual consistency is fine for dashboards or webhook delivery status. A common design uses strongly consistent writes on a small critical path and async fanout for side effects.
-
Reconciliation is part of correctness, not an afterthought. Build jobs that compare internal ledgers with payment provider settlements, GPU allocator reservations with actual usage, or webhook delivery attempts with terminal states. Track metrics like duplicate rate, retry rate, reconciliation deltas, and stuck operations.
-
Failure-state machines make distributed operations explainable. Payment states might be
CREATED → AUTHORIZED → CAPTURED → SETTLEDorFAILED; transitions are monotonic, persisted, and tied to external provider ids. Avoid ambiguous “pending” states without timeout and recovery behavior.
Worked example
For Design a scalable payment processor, a strong candidate should first ask what “payment processor” owns: payment intent APIs, provider integrations, ledgering, refunds, webhooks, and reconciliation, but not card network internals. They should clarify scale, currencies, latency targets, whether multiple payment providers are needed, and the key invariant: a user request must not charge the customer twice, and the internal ledger must match external settlement.
A clean answer can be organized around four pillars. First, define the API: POST /payments requires Idempotency-Key, amount, currency, merchant/user id, and returns a stable payment_id; retries with the same key return the same result if the request body hash matches. Second, model storage: payments, payment_attempts, ledger_entries, and outbox_events in a strongly consistent store like Postgres or a sharded SQL system, with unique constraints on idempotency_key and operation_id.
Third, describe the execution path: create the payment intent transactionally, call the external provider with a provider idempotency key, persist provider response, append ledger entries only once, and publish events through an outbox. Fourth, cover async reliability: webhook ingestion from providers is also idempotent, consumers use inbox dedupe, and reconciliation jobs compare provider settlement files against internal ledger state.
One tradeoff to flag explicitly is synchronous versus asynchronous capture. Synchronous capture gives a simpler user experience but couples availability to provider latency; asynchronous capture improves resilience but requires a state machine and polling/webhook recovery. A strong close would be: “If I had more time, I’d go deeper on shard keys for merchant-scale isolation, PCI tokenization boundaries, refund/chargeback state machines, and operational runbooks for provider outages.”
A second angle
For Implement node messaging and path discovery, the same correctness principles show up without money or ledgers. Each node should attach a message id, source id, and sequence number so adjacent nodes can retry sends without causing duplicate processing. Path discovery needs a visited set or distance table keyed by node/message to avoid loops, and the protocol should define what happens when links fail, messages arrive out of order, or acknowledgments are lost.
The framing differs because the core invariant is not “money is conserved” but “eventually discover reachable paths without infinite forwarding or false confirmation.” You would talk more about graph traversal, TTL/hop limits, ack/nack behavior, and message complexity, e.g. flooding can be per discovery while maintaining per-destination routing state reduces repeated work. The transferable idea is the same: stable operation identifiers plus monotonic state transitions beat hoping the network behaves.
Common pitfalls
Pitfall: Claiming “use
Kafkafor exactly-once” as if it solves end-to-end correctness.
Kafka transactions can help within Kafka producer/consumer workflows, but they do not make an external payment API, database write, email send, or GPU allocation exactly-once. Say what the atomic boundary is, then explain dedupe and recovery outside that boundary.
Pitfall: Treating idempotency as just “retry the same request.”
Retries without a persisted idempotency record can still double-charge, double-debit, or double-allocate if the first attempt succeeded but the response was lost. A better answer names the unique key, the storage table, the request-hash check, the response replay behavior, and the TTL or retention policy.
Pitfall: Designing only the happy path.
A shallow answer says “service writes to DB, publishes event, worker processes it.” A stronger answer injects failures: DB commit succeeds but publish fails, worker crashes after side effect, provider webhook arrives before local request completes, two clients race to spend the same credit balance, or reconciliation finds a mismatch.
Connections
Interviewers often pivot from this topic into distributed transactions, consensus, rate limiting, resource scheduling, or observability. Be ready to discuss 2PC versus sagas, Raft/Paxos-backed coordination, token buckets for quota enforcement, and operational metrics like p99 latency, duplicate rate, retry storm rate, and reconciliation lag.
Further reading
-
Designing Data-Intensive Applications — Martin Kleppmann’s chapters on replication, transactions, and stream processing are the best single source for practical distributed correctness tradeoffs.
-
Life Beyond Distributed Transactions: An Apostate’s Opinion — Pat Helland’s classic paper on why business processes need identifiers, retries, and uncertainty-tolerant workflows.
-
Stripe API Idempotent Requests — concrete production API pattern for idempotency keys, request replay, and retry-safe payment creation.
Practice questions
- Design a scalable payment systemOpenAI · Software Engineer · Technical Screen · medium
- Design credit balance with vector-clock expirationsOpenAI · Software Engineer · Technical Screen · hard
- Design GPU credit allocatorOpenAI · Software Engineer · Technical Screen · hard
- Design an in-memory key-value databaseOpenAI · Software Engineer · Technical Screen · hard
- Design a GPU credit allocation serviceOpenAI · Software Engineer · Technical Screen · hard
- Implement node messaging and path discoveryOpenAI · Software Engineer · Technical Screen · Medium
- Design a scalable payment processorOpenAI · Software Engineer · Technical Screen · hard
- Design webhook, POI, chat, CI/CD, paymentsOpenAI · Software Engineer · Onsite · medium
Related concepts
- Distributed Systems Consistency And Low-Latency DesignSystem Design
- Idempotency And Concurrency ControlSystem Design
- Distributed Storage, Replication, and ConsistencySystem Design
- Distributed Systems Consistency, Reliability, And ObservabilitySystem Design
- Fault Tolerance, Idempotency, And Concurrency ControlSystem Design
- Distributed Systems FundamentalsCoding & Algorithms