Concurrency Control And Thread Safety

What's being tested

Interviewers are probing whether you can reason about shared mutable state under concurrent access without hand-waving away correctness. For Databricks, this matters because storage engines, metadata services, caching layers, and distributed execution components routinely serve many readers and writers while preserving durability, isolation, and predictable latency. A strong answer separates safety properties, like no lost updates or corrupted invariants, from liveness properties, like no deadlock, starvation, or unbounded blocking. The interviewer is usually looking for clear synchronization boundaries, explicit invariants, failure-mode analysis, and tradeoffs between coarse locking, fine-grained locking, lock-free techniques, and versioned designs.

Core knowledge

Thread safety means every public operation preserves the data structure’s invariants under arbitrary valid interleavings. Define the invariant first: queue size is always 0 <= size <= capacity, a key-value store index always points to a durable record, or a snapshot iterator observes a stable logical version.
Race conditions happen when correctness depends on timing. The classic read-modify-write bug is x = x + 1: two threads can read the same old value and lose one increment. Fixes include mutexes, atomic compare-and-swap, sharding counters, or redesigning the operation to be idempotent.
Linearizability is the usual correctness target for concurrent in-memory APIs: every operation appears to take effect atomically at some instant between call and return. For example, put(k, v) in a key-value store should have a clear linearization point, often the lock-protected index update or successful CAS.
Condition variables solve blocking coordination, not mutual exclusion by themselves. A bounded queue typically uses one mutex plus notEmpty and notFull; calls to wait() must occur in a while loop because of spurious wakeups and because another thread may consume the condition first.
Backpressure is a correctness and stability tool. A bounded queue should define behavior for enqueue: block forever, block until timeout, return false, or throw. Capacity prevents unbounded memory growth, but poor wakeup policy can cause convoying or starvation under high producer/consumer contention.
Lock granularity controls contention. A single global lock is simple and often sufficient up to moderate concurrency; striped locking hashes keys across, say, 64 or 256 locks to improve throughput. Fine-grained locks reduce contention but increase deadlock risk, complexity, and cache-coherence overhead.
Deadlock prevention requires a consistent lock order or avoiding nested locks. If a range cache locks chunk metadata and then an in-flight request map, every code path must acquire them in the same order. Timeouts detect symptoms but do not prove safety.
Read-write locks help only when reads dominate and read sections are nontrivial. They can hurt when writes are frequent, readers are short, or implementation favors readers and starves writers. For highly read-heavy maps, copy-on-write or RCU-style versioned snapshots may be cleaner.
MVCC and snapshot isolation let readers avoid blocking writers by reading immutable versions. A snapshotable set can store insert/delete version numbers per element, so an iterator at version v returns elements where created <= v < deleted. This trades memory and garbage collection complexity for stable iteration.
Write-ahead logging is the foundation of durability in a key-value store. The rule is: append the intent or record to the WAL and fsync as required before exposing the update in the in-memory index. Recovery replays committed log records; torn writes need checksums, lengths, and record boundaries.
Atomicity across memory and disk is subtle. If put(k, v) updates the in-memory index before the log is durable, a crash can expose acknowledged data that cannot be recovered. If the log is durable before the index update, recovery is safe because replay can rebuild the index.
In-flight request deduplication is a concurrency-control pattern for caches. For a range-aware file cache, maintain a map like (fileId, chunkId) -> Future; the first caller fetches the chunk, and later callers await the same future. On failure, remove the future so retries are possible.

Worked example

For Design a thread-safe bounded queue, a strong candidate starts by clarifying the API: enqueue(item, timeout), dequeue(timeout), size(), shutdown semantics, whether fairness is required, and whether null items are allowed. They would state assumptions early: fixed capacity, multiple producers and consumers, blocking operations with timeout, and linearizable behavior for successful operations. The answer can be organized around four pillars: internal state, synchronization strategy, blocking semantics, and edge cases.

The internal state is a circular buffer with head, tail, and count, where the invariant is 0 <= count <= capacity. The synchronization strategy is one mutex protecting all three fields, plus two condition variables: notFull for producers and notEmpty for consumers. enqueue waits in a while count == capacity loop, computes remaining timeout using a monotonic clock, inserts at tail, increments count, and signals notEmpty; dequeue mirrors this and signals notFull. The candidate should explicitly flag the tradeoff between signal() and broadcast(): signal() avoids thundering-herd wakeups, while broadcast() can be useful for shutdown or complex predicates.

A good answer also covers size(): either acquire the same mutex for exact linearizable size, or expose approximate size via an atomic counter if the API allows it. Fairness should be discussed separately from correctness; FIFO item order does not imply FIFO thread scheduling. To close, say that with more time you would add tests using randomized producer/consumer schedules, timeout boundary cases, interruption/shutdown behavior, and stress tests under tools like ThreadSanitizer or jcstress.

A second angle

For Design a durable key-value store, the same concept appears at the boundary between concurrent operations and crash recovery. Instead of only protecting an in-memory invariant like queue size, you must preserve a cross-layer invariant: acknowledged writes must survive process or machine crashes. A simple design uses a global write mutex around append WAL -> fsync policy -> update memtable/index, while concurrent readers use a read lock or immutable memtable snapshot. The key framing difference is that “thread-safe” is not enough: an update that is race-free in memory can still be incorrect if the crash happens after index mutation but before durable log persistence. The interviewer may push you toward higher throughput, where you discuss group commit, lock striping by key, immutable segments, and compaction coordination.

Common pitfalls

Pitfall: Treating volatile or atomics as a universal replacement for locks.

Atomic visibility does not protect compound invariants like head, tail, and count moving together. A better answer says exactly which operations need mutual exclusion, which can use atomics safely, and where the linearization point is.

Pitfall: Describing a lock but not the waiting protocol.

For blocking structures, “use a mutex” is incomplete. You need to specify condition predicates, while-based waits, timeout recomputation, wakeup signaling, and what happens on shutdown, interruption, or cancellation.

Pitfall: Optimizing before proving correctness.

Jumping straight to lock-free queues, fine-grained range locks, or custom MVCC can sound sophisticated but often hides missing invariants. Start with a correct coarse-grained design, name its bottleneck, then evolve to striped locks, immutable snapshots, or in-flight future deduplication only when the workload justifies it.

Connections

Interviewers can pivot from this topic into storage engine design, especially WALs, memtables, compaction, and crash recovery. They may also move toward distributed concurrency control, including leases, optimistic concurrency, idempotency keys, and consensus-backed metadata updates. For coding-heavy follow-ups, expect implementation details around ReentrantLock, synchronized, std::mutex, Condition, Semaphore, or atomic CAS loops.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts