PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Databricks

Design a single-node persistent in-memory cache

Last updated: Apr 17, 2026

Quick Overview

This question evaluates a candidate's skills in single-node cache design, including in-memory key/value storage, eviction policy management (e.g., LRU), durability considerations, concurrency control, and techniques for reducing contention such as sharding or striping.

  • hard
  • Databricks
  • System Design
  • Software Engineer

Design a single-node persistent in-memory cache

Company: Databricks

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

## Scenario Design a **single-machine** cache used by a web service to handle read/write requests. The cache should: - Store key/value pairs in memory (fast reads) - Support typical operations: `Get(key)`, `Put(key, value)`, `Delete(key)` - Enforce an eviction policy such as **LRU** when memory capacity is reached - Optionally provide **durability** so that after a process crash/restart, the cache can recover its contents The interviewer expects: - A concrete design with data structures - Pseudocode-level APIs and key internal logic - Discussion of concurrency (locks), minimizing critical sections, avoiding deadlocks - How to scale on a single machine (e.g., sharding/striping to reduce lock contention)

Quick Answer: This question evaluates a candidate's skills in single-node cache design, including in-memory key/value storage, eviction policy management (e.g., LRU), durability considerations, concurrency control, and techniques for reducing contention such as sharding or striping.

Solution

This is a **low-level / single-machine** design question, not a distributed-systems question. The interviewer wants concrete data structures, real pseudocode, fine-grained reasoning about locks and critical sections, deadlock analysis, and a credible durability story (WAL + recovery). I'll structure the answer accordingly and keep "scale" scoped to one box (threads, cores, memory bandwidth) rather than multiple machines. ### 1) Clarify requirements & scope **Functional** - `Get(key) -> value | miss`, `Put(key, value)`, `Delete(key) -> bool`. - Eviction policy: **LRU** when capacity is reached. - Optional **durability**: survive a process crash / restart and recover contents. **Non-functional** - Single process, **many worker threads** (web service request handlers) hitting the cache concurrently. - Low latency: `Get`/`Put` should be roughly $O(1)$ and avoid blocking on disk I/O. - High concurrent throughput → minimize lock contention, no deadlocks. - Bounded memory: enforce a hard cap (by byte count is more realistic than entry count for variable-sized values). **Questions I'd ask the interviewer (and my default assumptions)** - *Capacity unit?* → bytes (assume entry-count cap is a trivial special case). - *Durability strength?* → make it a knob: **sync** (`fsync` before ack, survives OS crash) vs **async** (background flush, may lose the last few ms). Default async; can tighten to sync. - *Consistency per key?* → linearizable per key: a `Get` ordered after a successful `Put`/`Delete` observes that write. No cross-key transactions in scope. - *Value semantics?* → values are immutable byte blobs; we return a reference-counted copy, not a pointer into the cache (so a concurrent eviction can't free memory under a reader). - *Exact vs approximate LRU?* → exact per-shard LRU; global exact LRU is explicitly given up when we shard (justified below). ### 2) Back-of-envelope sizing (single box) Order-of-magnitude only, to justify the design — not real benchmarks: - Target a cache of, say, $32\,\text{GB}$ RAM, average value $1\,\text{KB}$ → on the order of $3\times10^7$ entries. - Per-entry overhead: key + `Node` (two list pointers + size field) + hash-table slot ≈ tens of bytes; meaningful at $10^7$ entries, so I track bytes including overhead. - Throughput goal: a single global lock serializes everything; a critical section of ~$100\,\text{ns}$ caps you near $10^7$ ops/s and a multi-core box will spend most of its time spinning/parking on that one lock. Sharding into $S$ stripes lets independent keys proceed in parallel, so throughput scales with cores until you hit memory bandwidth — this is the main motivation for striping. - **Never do disk I/O inside the lock.** An `fsync` is on the order of $10^{-3}\,\text{s}$; holding a shard lock across it would collapse throughput by ~$10^4\times$. The WAL `fsync` must always happen outside the critical section (the cheap buffer-append may happen inside it — see §6). ### 3) Core in-memory data structure (the LRU) Classic $O(1)$ LRU = **hash map + intrusive doubly linked list**: - `HashMap<Key, Node*>` for $O(1)$ lookup. - Doubly linked list orders entries by recency: **head = most-recently-used**, **tail = least-recently-used**. - `Node` holds `key, value, prev, next, size`. The `key` is stored on the node too so eviction can erase the map entry without a reverse lookup. ```text struct Node { Key key Value value // immutable, ref-counted buffer Node* prev Node* next size_t size // bytes charged to capacity (key+value+overhead) } ``` Why doubly linked: moving a node to the head on access and unlinking the tail on eviction are both $O(1)$ only if the node knows its neighbors. A singly linked list would make move-to-head $O(n)$. ### 4) Single-thread logic (correctness first, locks later) Pseudocode for one shard (each shard owns its own map + list + byte counters). `cur_bytes`/`max_bytes` bound the shard. Locking is added in §5; here I show only the in-memory mutation each op performs while it holds its shard lock. ```text Get(k): node = map.find(k) if node == null: return MISS move_to_head(node) return node.value // immutable, ref-counted buffer (kept alive past lock release; see §5c) Put(k, v): node = map.find(k) if node != null: # update existing cur_bytes += size(v) - node.size node.value = v; node.size = size(v) move_to_head(node) else: # insert new node = new Node(k, v) map[k] = node push_front(node) # new entry is MRU → at the head cur_bytes += node.size evict_until_under_cap() Delete(k): node = map.find(k) if node == null: return false unlink(node); map.erase(k); cur_bytes -= node.size return true evict_until_under_cap(): while cur_bytes > max_bytes and list not empty: victim = tail unlink(victim); map.erase(victim.key); cur_bytes -= victim.size # if durable: stage an EVICT/DEL record for victim.key (see §6) ``` **Capacity by bytes:** charge `size = bytes(key) + bytes(value) + fixed_overhead`. A `Put` that updates an existing key must adjust `cur_bytes` by the *delta*, and may itself trigger eviction of other keys to make room. Note the just-written key is never the victim: `push_front`/`move_to_head` makes it the **head (MRU)**, and `evict_until_under_cap` always evicts from the **tail (LRU)** — so the new key is structurally safe regardless of when the eviction loop runs. (The genuine edge case — a single value larger than the whole shard cap — is handled in §7, not here.) ### 5) Concurrency design #### 5a. Naive baseline: one global mutex One `mutex` guards map + list + counters. Correct and trivial, but every `Get` is a write (it reorders the list), so even read-heavy traffic fully serializes. Unacceptable on a multi-core box. State it, then improve. #### 5b. Recommended: sharded / striped cache Partition the key space into $S$ independent shards (e.g., $S = 256$, a power of two): ```text shard = shards[ hash(key) & (S - 1) ] // S power of two → mask, no modulo ``` - Each shard has its **own lock, map, list, and byte cap** (`max_bytes / S`). - Independent keys land in different shards → their operations run fully in parallel. - Contention now scales with collisions in the *same* shard, not all traffic. Pick $S \gg$ core count to keep collision probability low. **Tradeoff — LRU becomes per-shard, not global.** A globally-cold key in a hot shard can be evicted while a warmer key in a quiet shard survives. With a decent hash and $S \gg$ cores, the hit-rate loss is small in practice; I'd call it **"sharded LRU"** and flag it as the deliberate cost of removing the global lock. (True global LRU needs a single shared recency structure, which reintroduces the global hot spot — not worth it here.) #### 5c. Minimizing the critical section - **`Get` mutates recency**, so the obvious implementation takes the shard lock exclusively even on reads. A `RWLock` doesn't help much because move-to-head is a write; the read fraction that *also* needs to reorder still needs exclusive access. Two ways to make `Get` lighter if reads dominate: 1. **Sample/approximate LRU:** only move-to-head probabilistically (e.g., 1 in N accesses), or use a CLOCK / second-chance bit instead of strict list moves. Reads become a lock-light "set a referenced bit" op; eviction sweeps the clock hand. This trades exact LRU for cheaper reads. 2. Keep strict LRU but make the locked region *only* the pointer surgery: bump the value's ref-count under the lock, release the lock, then copy/read the buffer **after** the lock is dropped (the ref-count taken under the lock keeps it alive even if a concurrent eviction unlinks the node). - **Do all expensive work outside the lock:** allocation of the new `Node`/value buffer, hashing, and especially WAL `fsync` happen before/after the locked region, never inside it. The locked region is just: map lookup/insert, list splice, counter update, and (when durable) a cheap buffered WAL append (§6). #### 5d. Deadlock avoidance Single-key ops take exactly one shard lock → **no deadlock possible** (can't hold-and-wait across two locks). The only deadlock risk is a *future* multi-key operation (e.g., an atomic multi-`Put`, or a background snapshot that wants several shards). Rules: - **Global lock ordering:** always acquire shard locks in ascending shard-id order. A cycle in the wait-for graph is impossible if every thread acquires in the same total order. - Or **`try_lock` + backoff**: lock shard A, `try_lock` shard B; on failure release A and retry, breaking hold-and-wait. - Snapshots avoid the problem entirely by locking **one shard at a time** (see §6). This is exactly the "eyeball the deadlock" follow-up the interviewer probes: the invariant to recite is *single lock per op, or a strict global acquisition order for multi-lock ops.* ### 6) Durability: Write-Ahead Log + snapshots **Goal:** after crash/restart, rebuild cache contents. Approach: append-only **WAL** for incremental writes + periodic **snapshot** to bound WAL size. #### WAL record format Append-only, self-describing, corruption-detecting: ```text [ len:4 ][ type:1 ][ lsn:8 ][ key_len:4 ][ key ][ val_len:4 ][ value ][ crc32:4 ] type ∈ { PUT, DEL } // EVICT is logged as DEL to keep replay deterministic ``` - **Length prefix + trailing CRC** detect a torn final write (crash mid-append): on recovery, a record whose CRC fails or whose length runs past EOF is discarded — we stop replay there. - **LSN** (monotonic log sequence number) lets us correlate WAL position with snapshots. #### Write path (two durability modes — make it a knob) The subtle correctness point: **the WAL must reflect the order in which operations took effect in memory.** If two threads `Put` the same key, the WAL order must match the in-memory winner, or recovery diverges from the live state. The way to guarantee this is non-negotiable: **assign the LSN and append the record under the same shard lock that applies the mutation**, so for any key, log order == apply order by construction. The append is just a buffered write (cheap); the `fsync` still happens after the lock is released. ```text Put(k, v): # durable, canonical path shard = shard_for(k) rec = encode(PUT, k, v) # encode/alloc OUTSIDE the lock (no I/O here) lock(shard) lsn = wal.append(rec) # assign LSN + buffered append, UNDER the lock, # so log order == in-memory apply order for this key. # Buffered only — NO fsync inside the lock. apply Put(k, v) in memory # §4 logic, same critical section unlock(shard) if mode == SYNC: wal.fsync_through(lsn) # fsync OUTSIDE the lock; ack only after durable return ok ``` - **SYNC durability:** acknowledge only after `fsync` covers this record's LSN. Survives OS/power loss; higher latency. Use **group commit**: a background flusher `fsync`s once and wakes all writers whose LSN ≤ the flushed point — amortizes one `fsync` over many writes. - **ASYNC durability:** a background thread flushes the WAL buffer periodically; `Put` acks immediately after `unlock`. Lower latency, may lose the last few ms on crash. Default. Why append *inside* the shard lock rather than before it: appending before `lock(shard)` would decouple LSN-assignment order from in-memory-apply order — two concurrent `Put`s to the same key could append in one order but apply in the other, so the recovered last-writer would differ from the live last-writer. Doing both under the one lock removes that race. The cost is that WAL appends serialize *per shard* (not globally); since the WAL is the union of all shards' records ordered by LSN, and recovery only needs *per-key* last-writer-wins (independent keys are independent), per-shard append serialization is exactly sufficient. If a use case ever needed a single global total order of appends across keys, that would require a global log-append lock — explicitly out of scope for a cache recovering independent keys. #### Recovery (on startup) ```text recover(): load latest valid snapshot S # gives state up to snapshot_lsn for rec in WAL where rec.lsn > snapshot_lsn: if crc_ok(rec): apply rec (PUT/DEL) into memory # in ascending LSN order else: break # torn/partial tail record → stop # capacity may now be exceeded → run eviction to re-enforce caps re-run evict_until_under_cap() per shard ``` Replaying `PUT`/`DEL` in LSN order reconstructs the map; because evictions were logged as `DEL`, replay is deterministic and won't resurrect evicted keys. #### Snapshots / compaction (bound WAL growth) The WAL grows unbounded; periodically snapshot and truncate: - Background snapshot writes current contents to a new file and records `snapshot_lsn` (the highest LSN included). - After the snapshot is durably written, **truncate the WAL** before `snapshot_lsn`. - **Snapshotting without a long global pause:** iterate **shard by shard** — lock one shard, serialize its map, unlock, move to the next. This never holds more than one lock (deadlock-free) and never freezes the whole cache. The snapshot is then *fuzzy* (different shards captured at slightly different times); correctness is preserved because any write during the snapshot is also in the WAL after `snapshot_lsn`, so recovery (snapshot + WAL tail replay) is consistent. A copy-on-write page snapshot is the more advanced alternative but rarely needed in an interview. #### Eviction-vs-durability semantics (be explicit) Two coherent choices: 1. **WAL reconstructs exact in-memory state** → log evictions as `DEL` (chosen above). Replay reproduces the live cache exactly. 2. **WAL is best-effort warm-start** → log only `PUT`/`DEL` from clients, *don't* log evictions; on recovery the cache may temporarily exceed capacity and re-evicts itself. Simpler, but recovered set ⊇ live set. I'd pick (1) for determinism and mention (2) as the cheaper alternative. ### 7) Bottlenecks & tradeoffs | Concern | Choice | Tradeoff | |---|---|---| | Lock contention | Stripe into $S$ shards | Loses global LRU → "sharded LRU"; hit-rate slightly lower | | Read cost (reads mutate recency) | Exact list-move under lock, **or** CLOCK/sampled LRU | Cheaper reads vs strict recency accuracy | | Durability latency | SYNC + group commit vs ASYNC | Data loss window vs per-op latency | | WAL growth | Snapshot + truncate | Snapshot CPU/IO + fuzzy-snapshot reasoning | | Memory fragmentation at $10^7$ entries | Slab / arena allocator, pool `Node`s | Implementation complexity | | Hot key in one shard | More shards / re-hash | Diminishing returns; a single hot key can't be parallelized | | Use-after-free on eviction during read | Return ref-counted value buffers | Extra atomic ref-count ops | **Key edge cases:** concurrent `Put`+`Get` on the same key (resolved by per-shard lock + ref-counted values); crash mid-WAL-record (CRC/length framing stops replay); a single value larger than a shard's `max_bytes` (reject or special-case "uncacheable" — otherwise `evict_until_under_cap` would loop forever trying to get under cap); `Put` updating an existing key must adjust `cur_bytes` by the delta, not the full size. ### 8) Summary In-memory cache = **hash map + intrusive doubly linked list** for $O(1)$ `Get`/`Put`/`Delete`/LRU. Concurrency = **stripe into $S$ shards**, one lock each, so independent keys run in parallel; keep the critical section to pure pointer surgery plus a cheap buffered WAL append, and never `fsync` under a lock; deadlock-free because each op holds one lock (multi-lock ops use a strict global lock order). Durability = **WAL** (length-prefix + CRC framing, SYNC-with-group-commit or ASYNC modes) plus **per-shard fuzzy snapshots** to bound log size; recovery = load snapshot then replay WAL tail in LSN order, re-enforcing capacity. The non-negotiable invariants: assign each record's LSN under the same shard lock that applies the mutation (so log order == apply order per key), and keep disk I/O strictly outside critical sections. The deliberate cost is sharded (not global) LRU.

Related Interview Questions

  • Design a Slack-Like Messaging System - Databricks (medium)
  • Design a Book Price Aggregator - Databricks (medium)
  • Design a Distributed File System - Databricks (medium)
  • Design a stock order manager - Databricks (medium)
  • Design a Hierarchical File System - Databricks (hard)
Databricks logo
Databricks
Jan 22, 2026, 12:00 AM
Software Engineer
Technical Screen
System Design
172
0

Scenario

Design a single-machine cache used by a web service to handle read/write requests.

The cache should:

  • Store key/value pairs in memory (fast reads)
  • Support typical operations: Get(key) , Put(key, value) , Delete(key)
  • Enforce an eviction policy such as LRU when memory capacity is reached
  • Optionally provide durability so that after a process crash/restart, the cache can recover its contents

The interviewer expects:

  • A concrete design with data structures
  • Pseudocode-level APIs and key internal logic
  • Discussion of concurrency (locks), minimizing critical sections, avoiding deadlocks
  • How to scale on a single machine (e.g., sharding/striping to reduce lock contention)

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Databricks•More Software Engineer•Databricks Software Engineer•Databricks System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.