Fault Tolerance, Idempotency, And Concurrency Control
Asked of: Software Engineer
Last updated

What's being tested
Airbnb-style systems need to stay correct when requests are duplicated, services fail mid-operation, users race for scarce inventory, and background jobs retry hours later. Interviewers are probing whether you can design fault-tolerant APIs, idempotent workflows, and concurrency control mechanisms without hand-waving “the database handles it.” For a Software Engineer, the key skill is choosing where correctness must be strict, where eventual consistency is acceptable, and how to make failure modes observable and recoverable. Expect follow-ups around double-booking, duplicate notifications, retry storms, partial writes, message ordering, and hot-key contention.
Core knowledge
-
Idempotency means the same logical request can be executed multiple times with the same externally visible result. Common API pattern: require an
Idempotency-Key, store(caller_id, key, request_hash, response, status), and return the cached response for safe duplicates. -
At-least-once delivery is the default assumption for queues, webhooks, and retrying clients. If a worker reads from
SQS,Kafka, or a task queue and crashes after side effects but before acking, the job may run again; consumers must deduplicate. -
Exactly-once semantics are usually built from weaker primitives: atomic database writes, unique constraints, idempotency records, and deterministic retry behavior. Avoid promising “exactly once” globally; instead say “effectively once for this side effect under this key.”
-
Optimistic concurrency control uses a version column, compare-and-swap, or conditional update:
UPDATE rows SET version = version + 1 WHERE id = ? AND version = ?. If zero rows update, reload and retry. This works well when conflicts are rare. -
Pessimistic locking uses row locks like
SELECT ... FOR UPDATEinPostgresor distributed leases inRedis/ZooKeeper. It is simpler for scarce inventory but can reduce throughput, create lock waits, and fail under hot-key traffic. -
Unique constraints are one of the strongest correctness tools. For a waitlist, a unique index on
(listing_id, date, user_id)prevents duplicate joins; for notifications,(event_id, user_id, channel)prevents duplicate sends from concurrent workers. -
Retry policy should classify errors: retry
5xx, timeouts,429, transient network resets; do not blindly retry400, validation failures, auth errors, or non-idempotent payment-like side effects. Use exponential backoff with jitter:delay = random(0, min(cap, base * 2^attempt)). -
Deadlines and cancellation prevent retry wrappers from exceeding user-visible latency budgets. Track both per-attempt timeout and total deadline: if
p95service latency is200msand caller budget is1s, three attempts with backoff may already be too expensive. -
Transactional outbox solves the “database commit succeeded but message publish failed” problem. Write business state and an
outbox_eventsrow in onePostgrestransaction; a relay publishes toKafka/queue and marks the row sent idempotently. -
Ordering guarantees are scoped, not global. A group chat may need per-conversation ordering using a monotonically increasing
message_seq; notifications may only need per-user de-dupe, while waitlists need FIFO per(listing_id, date). -
Hot partitions happen when many users target the same listing, chat, or notification campaign. Mitigations include sharding by composite key, queueing per entity, admission control, rate limits, caching reads, and moving expensive fan-out to async workers.
-
Observability must expose correctness and resilience signals: retry rate, duplicate suppression count, lock wait time, conflict retry count, queue lag,
DLQdepth, notification delivery status, idempotency cache hit rate, andp99latency under contention.
Worked example
For Design a hot-listing waitlist API, a strong first 30 seconds would clarify whether users can join multiple dates, whether ordering must be strict FIFO, whether a user can cancel, and what happens when inventory opens. They should declare an assumption like: “Correctness matters more than raw throughput for a single listing-date; I’ll design per (listing_id, date) ordering and prevent duplicate joins.” The answer can be organized around four pillars: API shape, data model, concurrency control, and async notification/retry handling. The API might expose POST /listings/{id}/dates/{date}/waitlist with an Idempotency-Key, returning the existing waitlist position if the same user retries. The data model should include a waitlist_entries table with unique (listing_id, date, user_id), a created_at or sequence for ordering, and status values like WAITING, OFFERED, EXPIRED, BOOKED, CANCELLED. For concurrency, use a database transaction and unique constraints for joins; when inventory opens, select the next eligible entries with FOR UPDATE SKIP LOCKED or a per-listing-date worker to avoid two users receiving the same offer. A tradeoff to flag explicitly: strict FIFO with database locks is simple and correct, but hot listings may need partitioned queues or single-flight processing per listing-date to reduce lock contention. Close by saying that with more time, you would cover expiration timers, notification retries, abuse/rate limits, and operational metrics like duplicate-join attempts and offer acceptance latency.
A second angle
For Design a multi-channel notification system, the same ideas apply, but scarce inventory is replaced by side-effect control across email, push, and SMS vendors. Here the main danger is duplicate or out-of-order sends: a retrying worker might send two SMS messages unless the system records (event_id, user_id, channel) before or atomically with dispatch. You would use an idempotent notification request API, persist a notification intent, fan out to channel-specific jobs, and make each sender idempotent against provider retries and internal retries. The consistency requirement is usually weaker than waitlist FIFO: eventual delivery is acceptable, but duplicate suppression, rate limiting, user preferences, and auditability are critical. A good tradeoff discussion compares synchronous sends for low latency against async queue-based sends for reliability and backpressure.
Common pitfalls
Pitfall: Treating retries as a universal fix.
A weak answer says “if the request fails, retry three times” without defining retryable errors, deadlines, jitter, or idempotency. A stronger answer separates transient failures from permanent failures, applies exponential backoff with jitter, and explains how duplicate attempts are made safe.
Pitfall: Assuming database transactions solve cross-system side effects.
A transaction can atomically update Postgres, but it cannot atomically send an email, push to a vendor, and commit local state unless you add a pattern like transactional outbox or idempotent send records. Say exactly where atomicity ends and how recovery jobs reconcile incomplete work.
Pitfall: Over-optimizing scalability before proving correctness.
For hot waitlists or chat ordering, jumping immediately to sharded distributed queues can obscure the core invariant. Start with the invariant: “one offer per inventory unit,” “one message sequence per conversation,” or “one notification per event-channel-user,” then scale the implementation while preserving it.
Connections
Interviewers may pivot from this topic into API design, database schema design, distributed queues, rate limiting, or real-time fan-out. They may also ask for failure-mode analysis: what happens if the worker crashes after sending, if the database commit succeeds but the queue publish fails, or if two clients submit the same action concurrently.
Further reading
-
Designing Data-Intensive Applications — Martin Kleppmann’s book is the best practical foundation for transactions, replication, ordering, and exactly-once tradeoffs.
-
Exponential Backoff and Jitter — AWS’s canonical explanation of why jitter prevents synchronized retry storms.
-
Stripe Idempotent Requests — clear production example of idempotency keys, cached responses, and duplicate request handling.
Featured in interview prep guides
Practice questions
- Design a Scalable Job SchedulerAirbnb · Software Engineer · Onsite · none
- Design a group chat systemAirbnb · Software Engineer · Onsite · hard
- Design an extensible request RetryerAirbnb · Software Engineer · Technical Screen · medium
- Design a multi-channel notification systemAirbnb · Software Engineer · Technical Screen · hard
- Implement retry wrapper and interdependent validatorsAirbnb · Software Engineer · Onsite · hard
- Design and implement an Airbnb walletAirbnb · Software Engineer · Technical Screen · medium
- Design a hot-listing waitlist APIAirbnb · Software Engineer · Onsite · hard
Related concepts
- Idempotency And Concurrency ControlSystem Design
- Distributed Systems Correctness And IdempotencySystem Design
- Distributed Systems Consistency And Low-Latency DesignSystem Design
- Concurrency Control And Thread SafetySystem Design
- Concurrency ControlSystem Design
- Distributed Systems Consistency, Reliability, And ObservabilitySystem Design