Fault Tolerance, Idempotency, And Concurrency Control

What's being tested

Airbnb-style systems need to stay correct when requests are duplicated, services fail mid-operation, users race for scarce inventory, and background jobs retry hours later. Interviewers are probing whether you can design fault-tolerant APIs, idempotent workflows, and concurrency control mechanisms without hand-waving “the database handles it.” For a Software Engineer, the key skill is choosing where correctness must be strict, where eventual consistency is acceptable, and how to make failure modes observable and recoverable. Expect follow-ups around double-booking, duplicate notifications, retry storms, partial writes, message ordering, and hot-key contention.

Core knowledge

Idempotency means the same logical request can be executed multiple times with the same externally visible result. Common API pattern: require an Idempotency-Key, store (caller_id, key, request_hash, response, status), and return the cached response for safe duplicates.
At-least-once delivery is the default assumption for queues, webhooks, and retrying clients. If a worker reads from SQS, Kafka, or a task queue and crashes after side effects but before acking, the job may run again; consumers must deduplicate.
Exactly-once semantics are usually built from weaker primitives: atomic database writes, unique constraints, idempotency records, and deterministic retry behavior. Avoid promising “exactly once” globally; instead say “effectively once for this side effect under this key.”
Optimistic concurrency control uses a version column, compare-and-swap, or conditional update: UPDATE rows SET version = version + 1 WHERE id = ? AND version = ?. If zero rows update, reload and retry. This works well when conflicts are rare.
Pessimistic locking uses row locks like SELECT ... FOR UPDATE in Postgres or distributed leases in Redis/ZooKeeper. It is simpler for scarce inventory but can reduce throughput, create lock waits, and fail under hot-key traffic.
Unique constraints are one of the strongest correctness tools. For a waitlist, a unique index on (listing_id, date, user_id) prevents duplicate joins; for notifications, (event_id, user_id, channel) prevents duplicate sends from concurrent workers.
Retry policy should classify errors: retry 5xx, timeouts, 429, transient network resets; do not blindly retry 400, validation failures, auth errors, or non-idempotent payment-like side effects. Use exponential backoff with jitter: delay = random(0, min(cap, base * 2^attempt)).
Deadlines and cancellation prevent retry wrappers from exceeding user-visible latency budgets. Track both per-attempt timeout and total deadline: if p95 service latency is 200ms and caller budget is 1s, three attempts with backoff may already be too expensive.
Transactional outbox solves the “database commit succeeded but message publish failed” problem. Write business state and an outbox_events row in one Postgres transaction; a relay publishes to Kafka/queue and marks the row sent idempotently.
Ordering guarantees are scoped, not global. A group chat may need per-conversation ordering using a monotonically increasing message_seq; notifications may only need per-user de-dupe, while waitlists need FIFO per (listing_id, date).
Hot partitions happen when many users target the same listing, chat, or notification campaign. Mitigations include sharding by composite key, queueing per entity, admission control, rate limits, caching reads, and moving expensive fan-out to async workers.
Observability must expose correctness and resilience signals: retry rate, duplicate suppression count, lock wait time, conflict retry count, queue lag, DLQ depth, notification delivery status, idempotency cache hit rate, and p99 latency under contention.

Worked example

For Design a hot-listing waitlist API, a strong first 30 seconds would clarify whether users can join multiple dates, whether ordering must be strict FIFO, whether a user can cancel, and what happens when inventory opens. They should declare an assumption like: “Correctness matters more than raw throughput for a single listing-date; I’ll design per (listing_id, date) ordering and prevent duplicate joins.” The answer can be organized around four pillars: API shape, data model, concurrency control, and async notification/retry handling. The API might expose POST /listings/{id}/dates/{date}/waitlist with an Idempotency-Key, returning the existing waitlist position if the same user retries. The data model should include a waitlist_entries table with unique (listing_id, date, user_id), a created_at or sequence for ordering, and status values like WAITING, OFFERED, EXPIRED, BOOKED, CANCELLED. For concurrency, use a database transaction and unique constraints for joins; when inventory opens, select the next eligible entries with FOR UPDATE SKIP LOCKED or a per-listing-date worker to avoid two users receiving the same offer. A tradeoff to flag explicitly: strict FIFO with database locks is simple and correct, but hot listings may need partitioned queues or single-flight processing per listing-date to reduce lock contention. Close by saying that with more time, you would cover expiration timers, notification retries, abuse/rate limits, and operational metrics like duplicate-join attempts and offer acceptance latency.

A second angle

For Design a multi-channel notification system, the same ideas apply, but scarce inventory is replaced by side-effect control across email, push, and SMS vendors. Here the main danger is duplicate or out-of-order sends: a retrying worker might send two SMS messages unless the system records (event_id, user_id, channel) before or atomically with dispatch. You would use an idempotent notification request API, persist a notification intent, fan out to channel-specific jobs, and make each sender idempotent against provider retries and internal retries. The consistency requirement is usually weaker than waitlist FIFO: eventual delivery is acceptable, but duplicate suppression, rate limiting, user preferences, and auditability are critical. A good tradeoff discussion compares synchronous sends for low latency against async queue-based sends for reliability and backpressure.

Common pitfalls

Pitfall: Treating retries as a universal fix.

A weak answer says “if the request fails, retry three times” without defining retryable errors, deadlines, jitter, or idempotency. A stronger answer separates transient failures from permanent failures, applies exponential backoff with jitter, and explains how duplicate attempts are made safe.

Pitfall: Assuming database transactions solve cross-system side effects.

A transaction can atomically update Postgres, but it cannot atomically send an email, push to a vendor, and commit local state unless you add a pattern like transactional outbox or idempotent send records. Say exactly where atomicity ends and how recovery jobs reconcile incomplete work.

Pitfall: Over-optimizing scalability before proving correctness.

For hot waitlists or chat ordering, jumping immediately to sharded distributed queues can obscure the core invariant. Start with the invariant: “one offer per inventory unit,” “one message sequence per conversation,” or “one notification per event-channel-user,” then scale the implementation while preserving it.

Connections

Interviewers may pivot from this topic into API design, database schema design, distributed queues, rate limiting, or real-time fan-out. They may also ask for failure-mode analysis: what happens if the worker crashes after sending, if the database commit succeeds but the queue publish fails, or if two clients submit the same action concurrently.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts