Durable Key-Value Stores And Caches
Asked of: Software Engineer
Last updated

What's being tested
Databricks is probing whether you can design storage-backed, concurrent key-value systems with clear APIs, predictable failure behavior, and justified performance tradeoffs. A strong answer covers both the data path (get, put, delete) and the failure path: crash recovery, partial writes, fsync semantics, corruption, and concurrent mutations. Interviewers are not looking for a production clone of `RocksDB`; they want to see whether you can reason from first principles about durability, in-memory indexing, eviction, synchronization, and time-windowed counters. This matters for a Software Engineer at Databricks because many platform components depend on local metadata stores, caches, execution-state stores, and high-QPS services where correctness under concurrency is as important as throughput.
Core knowledge
-
API shape should be explicit before internals:
get(key) -> Optional<Value>,put(key, value),delete(key), optionalcompareAndSwap(key, expectedVersion, newValue),ttl,scan(prefix), andflush. Clarify whether operations are single-key atomic, whether values are bytes or generic typed objects, and whether reads after writes must be immediately visible. -
Durability usually starts with a write-ahead log: append mutation records to disk before applying them to in-memory state. On restart, rebuild the in-memory map by replaying the log in order. The key invariant is: if
put(k,v)is acknowledged, the log record must survive a crash, typically requiringfsyncor group commit. -
Crash recovery requires handling torn writes and partial records. A common record format is
[magic][length][sequence_number][op][key][value][crc32]. During recovery, stop at the first invalid checksum or incomplete length-delimited record. Sequence numbers help resolve duplicate replay and preserve last-writer-wins ordering. -
On-disk layout tradeoffs are central. An append-only log gives fast writes, but unbounded disk growth; a hash index in memory maps keys to file offsets for fast reads. Periodic compaction rewrites only the latest value for each key and drops tombstones older than any active reader.
-
LSM-tree designs, like
`LevelDB`and`RocksDB`, use a mutable memtable, immutable sorted runs, and background compaction. They optimize write throughput and range scans but introduce read amplification and compaction stalls. A simple interview design can mention LSM as an extension, not start there unless range scans or large data volumes require it. -
B-tree designs, like many database indexes, update pages in place and are good for read-heavy workloads and range queries. They require careful page-level crash safety through journaling, copy-on-write, or WAL redo/undo. For an embeddable KV store with simple point lookups, append-only plus compaction is often easier to reason about.
-
Concurrency control must identify shared mutable state: the map, log file offset, LRU list, TTL heap, hit counters, and compaction metadata. A simple design uses a
ReadWriteLock: concurrentgets acquire read locks;put/deleteacquire write locks. Higher-throughput designs use lock striping by key hash plus a separate serialized log append path. -
Race conditions often come from compound operations: “check then insert,” LRU update during
get, TTL expiration while a writer updates the same key, or computing`QPS`while requests are being recorded. The fix is to define the linearization point: forput, it might be after WAL append succeeds and before the map pointer is swapped. -
Persistent cache design combines eviction policy with durability. A typical in-memory cache has
HashMap<Key, Node>plus a doubly linked list for LRU, givingO(1)get,put, and eviction. If persistence is required, evictions and updates must also be logged or the cache may resurrect evicted entries after restart. -
TTL semantics need precision. Store
expiresAt = now + ttlusing a monotonic clock where possible for in-process comparisons, but persist wall-clock timestamps if expiration must survive restart. Lazy expiration checks ongetare simple; proactive expiration uses a min-heap or timing wheel but adds concurrency complexity. -
Sliding-window QPS can be implemented with fixed-size time buckets. For a window seconds and bucket size , keep buckets containing request counts and timestamps. Average QPS is after ignoring stale buckets. Smaller buckets improve accuracy but increase update contention and memory.
-
Serialization and versioning matter for a generic store. Use a pluggable
`Serializer<T>`interface that returns bytes and can fail explicitly. Include a schema/version byte in records so future readers can decode old values. Avoid Java/Python object serialization as a default because it is brittle across code changes and unsafe for untrusted data.
Worked example
For Design a durable key-value store, start by clarifying scope in the first 30 seconds: “Is this single-node or distributed? Are keys and values bounded in size? Do we need range scans, TTL, transactions, or only single-key atomicity? What durability guarantee is expected after put returns?” Then declare a reasonable baseline: single-node embedded store, byte-array keys and values, point reads/writes, last-writer-wins, and durability for acknowledged writes.
Organize the answer around four pillars: API and guarantees, write/read path, recovery and compaction, and concurrency/performance. The write path is: serialize record, append to WAL, fsync according to policy, then update the in-memory hash index from key to latest value location. The read path is: check the in-memory index, seek to the offset if values are only on disk, or return directly if the full value is cached in memory.
For recovery, replay valid WAL records from the beginning or from the latest snapshot, rebuild the hash index, and stop at the first partial/corrupt record using a checksum. For compaction, write a new segment containing only the latest live records, atomically install a manifest pointing to the new segment, then delete old segments after no readers depend on them. A key tradeoff to flag is latency versus durability: calling fsync on every write gives strong guarantees but high `p99`; batching fsync every few milliseconds improves throughput but can lose recent acknowledged writes unless acknowledgments wait for the batch flush. Close by saying that with more time you would add checksummed segment files, background compaction scheduling, metrics for log size and recovery time, and optional compareAndSwap for conditional updates.
A second angle
For Design a single-node persistent in-memory cache, the same storage ideas apply, but the priority shifts from durable database semantics to bounded memory and fast access. The core structure is HashMap<Key, Entry> plus an LRU list or segmented LRU policy, with a WAL recording put, delete, and eviction events so restart reconstructs the same logical cache. The tricky part is that get mutates recency state, so a supposedly read-only operation can contend on the LRU lock. A strong design might batch recency updates, use sharded LRUs, or accept approximate LRU to reduce contention. Unlike the durable store, it is acceptable to discuss weaker durability if the cache can be rebuilt, but you must state that assumption explicitly.
Common pitfalls
Pitfall: Treating “durable” as “write to a file.”
Writing bytes to a file is not enough; data can sit in OS page cache, records can be partially written, and directory metadata may not be durable after rename. A better answer distinguishes write, flush, fsync, atomic rename, checksums, and recovery behavior after crashes at different points.
Pitfall: Ignoring the linearization point under concurrency.
A tempting but weak answer says “use a mutex” without explaining what operation becomes atomic. Interviewers may push with two threads doing put(k,1) and get(k) while a WAL append is in progress; you should define whether visibility happens before or after the log is durable and ensure the map and log cannot disagree for acknowledged operations.
Pitfall: Overdesigning into a distributed database.
Do not jump to sharding, consensus, quorum reads, or `Raft` unless the prompt asks for multiple nodes. For these interviews, a precise single-node design with WAL, compaction, locking, and recovery is usually stronger than a vague distributed architecture. If you mention distributed extensions, keep them as optional follow-ups.
Connections
Interviewers may pivot from here into thread-safe data structures, LSM-tree storage engines, cache eviction algorithms, rate counters, or idempotent APIs. They may also ask how this differs from using `Redis`, `RocksDB`, `SQLite`, or an in-process `ConcurrentHashMap`, so be ready to compare guarantees rather than just features.
Further reading
-
Designing Data-Intensive Applications — Martin Kleppmann’s chapters on storage, replication, and consistency give strong mental models for logs, indexes, and crash recovery.
-
The Log-Structured Merge-Tree — the foundational paper behind many write-optimized KV stores such as
`LevelDB`and`RocksDB`. -
RocksDB Wiki — practical details on WALs, memtables, compaction, write amplification, and performance tuning in a real embedded storage engine.
Featured in interview prep guides
Practice questions
- Design a single-node persistent in-memory cacheDatabricks · Software Engineer · Technical Screen · hard
- Design KV store with sliding-window average QPSDatabricks · Software Engineer · Technical Screen · medium
- Identify and handle race conditionsDatabricks · Software Engineer · Technical Screen · hard
- Design a generic key-value storeDatabricks · Software Engineer · Technical Screen · medium
- Design a durable key-value storeDatabricks · Software Engineer · Onsite · hard
Related concepts
- Persistent Key-Value StoresCoding & Algorithms
- Distributed Key-Value Storage And TransactionsSystem Design
- Caching And Stateful Data Structure DesignCoding & Algorithms
- Serialization, Binary Encoding, And Persistent KV StoresCoding & Algorithms
- Stateful Data Structures And OOP API DesignCoding & Algorithms
- Stateful In-Memory Data StructuresCoding & Algorithms