## Scenario
Design a **single-machine** cache used by a web service to handle read/write requests.
The cache should:
- Store key/value pairs in memory (fast reads)
- Support typical operations: `Get(key)`, `Put(key, value)`, `Delete(key)`
- Enforce an eviction policy such as **LRU** when memory capacity is reached
- Optionally provide **durability** so that after a process crash/restart, the cache can recover its contents
The interviewer expects:
- A concrete design with data structures
- Pseudocode-level APIs and key internal logic
- Discussion of concurrency (locks), minimizing critical sections, avoiding deadlocks
- How to scale on a single machine (e.g., sharding/striping to reduce lock contention)
Quick Answer: This question evaluates a candidate's skills in single-node cache design, including in-memory key/value storage, eviction policy management (e.g., LRU), durability considerations, concurrency control, and techniques for reducing contention such as sharding or striping.
Solution
This is a **low-level / single-machine** design question, not a distributed-systems question. The interviewer wants concrete data structures, real pseudocode, fine-grained reasoning about locks and critical sections, deadlock analysis, and a credible durability story (WAL + recovery). I'll structure the answer accordingly and keep "scale" scoped to one box (threads, cores, memory bandwidth) rather than multiple machines.
### 1) Clarify requirements & scope
**Functional**
- `Get(key) -> value | miss`, `Put(key, value)`, `Delete(key) -> bool`.
- Eviction policy: **LRU** when capacity is reached.
- Optional **durability**: survive a process crash / restart and recover contents.
**Non-functional**
- Single process, **many worker threads** (web service request handlers) hitting the cache concurrently.
- Low latency: `Get`/`Put` should be roughly $O(1)$ and avoid blocking on disk I/O.
- High concurrent throughput → minimize lock contention, no deadlocks.
- Bounded memory: enforce a hard cap (by byte count is more realistic than entry count for variable-sized values).
**Questions I'd ask the interviewer (and my default assumptions)**
- *Capacity unit?* → bytes (assume entry-count cap is a trivial special case).
- *Durability strength?* → make it a knob: **sync** (`fsync` before ack, survives OS crash) vs **async** (background flush, may lose the last few ms). Default async; can tighten to sync.
- *Consistency per key?* → linearizable per key: a `Get` ordered after a successful `Put`/`Delete` observes that write. No cross-key transactions in scope.
- *Value semantics?* → values are immutable byte blobs; we return a reference-counted copy, not a pointer into the cache (so a concurrent eviction can't free memory under a reader).
- *Exact vs approximate LRU?* → exact per-shard LRU; global exact LRU is explicitly given up when we shard (justified below).
### 2) Back-of-envelope sizing (single box)
Order-of-magnitude only, to justify the design — not real benchmarks:
- Target a cache of, say, $32\,\text{GB}$ RAM, average value $1\,\text{KB}$ → on the order of $3\times10^7$ entries.
- Per-entry overhead: key + `Node` (two list pointers + size field) + hash-table slot ≈ tens of bytes; meaningful at $10^7$ entries, so I track bytes including overhead.
- Throughput goal: a single global lock serializes everything; a critical section of ~$100\,\text{ns}$ caps you near $10^7$ ops/s and a multi-core box will spend most of its time spinning/parking on that one lock. Sharding into $S$ stripes lets independent keys proceed in parallel, so throughput scales with cores until you hit memory bandwidth — this is the main motivation for striping.
- **Never do disk I/O inside the lock.** An `fsync` is on the order of $10^{-3}\,\text{s}$; holding a shard lock across it would collapse throughput by ~$10^4\times$. The WAL `fsync` must always happen outside the critical section (the cheap buffer-append may happen inside it — see §6).
### 3) Core in-memory data structure (the LRU)
Classic $O(1)$ LRU = **hash map + intrusive doubly linked list**:
- `HashMap<Key, Node*>` for $O(1)$ lookup.
- Doubly linked list orders entries by recency: **head = most-recently-used**, **tail = least-recently-used**.
- `Node` holds `key, value, prev, next, size`. The `key` is stored on the node too so eviction can erase the map entry without a reverse lookup.
```text
struct Node {
Key key
Value value // immutable, ref-counted buffer
Node* prev
Node* next
size_t size // bytes charged to capacity (key+value+overhead)
}
```
Why doubly linked: moving a node to the head on access and unlinking the tail on eviction are both $O(1)$ only if the node knows its neighbors. A singly linked list would make move-to-head $O(n)$.
### 4) Single-thread logic (correctness first, locks later)
Pseudocode for one shard (each shard owns its own map + list + byte counters). `cur_bytes`/`max_bytes` bound the shard. Locking is added in §5; here I show only the in-memory mutation each op performs while it holds its shard lock.
```text
Get(k):
node = map.find(k)
if node == null: return MISS
move_to_head(node)
return node.value // immutable, ref-counted buffer (kept alive past lock release; see §5c)
Put(k, v):
node = map.find(k)
if node != null: # update existing
cur_bytes += size(v) - node.size
node.value = v; node.size = size(v)
move_to_head(node)
else: # insert new
node = new Node(k, v)
map[k] = node
push_front(node) # new entry is MRU → at the head
cur_bytes += node.size
evict_until_under_cap()
Delete(k):
node = map.find(k)
if node == null: return false
unlink(node); map.erase(k); cur_bytes -= node.size
return true
evict_until_under_cap():
while cur_bytes > max_bytes and list not empty:
victim = tail
unlink(victim); map.erase(victim.key); cur_bytes -= victim.size
# if durable: stage an EVICT/DEL record for victim.key (see §6)
```
**Capacity by bytes:** charge `size = bytes(key) + bytes(value) + fixed_overhead`. A `Put` that updates an existing key must adjust `cur_bytes` by the *delta*, and may itself trigger eviction of other keys to make room. Note the just-written key is never the victim: `push_front`/`move_to_head` makes it the **head (MRU)**, and `evict_until_under_cap` always evicts from the **tail (LRU)** — so the new key is structurally safe regardless of when the eviction loop runs. (The genuine edge case — a single value larger than the whole shard cap — is handled in §7, not here.)
### 5) Concurrency design
#### 5a. Naive baseline: one global mutex
One `mutex` guards map + list + counters. Correct and trivial, but every `Get` is a write (it reorders the list), so even read-heavy traffic fully serializes. Unacceptable on a multi-core box. State it, then improve.
#### 5b. Recommended: sharded / striped cache
Partition the key space into $S$ independent shards (e.g., $S = 256$, a power of two):
```text
shard = shards[ hash(key) & (S - 1) ] // S power of two → mask, no modulo
```
- Each shard has its **own lock, map, list, and byte cap** (`max_bytes / S`).
- Independent keys land in different shards → their operations run fully in parallel.
- Contention now scales with collisions in the *same* shard, not all traffic. Pick $S \gg$ core count to keep collision probability low.
**Tradeoff — LRU becomes per-shard, not global.** A globally-cold key in a hot shard can be evicted while a warmer key in a quiet shard survives. With a decent hash and $S \gg$ cores, the hit-rate loss is small in practice; I'd call it **"sharded LRU"** and flag it as the deliberate cost of removing the global lock. (True global LRU needs a single shared recency structure, which reintroduces the global hot spot — not worth it here.)
#### 5c. Minimizing the critical section
- **`Get` mutates recency**, so the obvious implementation takes the shard lock exclusively even on reads. A `RWLock` doesn't help much because move-to-head is a write; the read fraction that *also* needs to reorder still needs exclusive access. Two ways to make `Get` lighter if reads dominate:
1. **Sample/approximate LRU:** only move-to-head probabilistically (e.g., 1 in N accesses), or use a CLOCK / second-chance bit instead of strict list moves. Reads become a lock-light "set a referenced bit" op; eviction sweeps the clock hand. This trades exact LRU for cheaper reads.
2. Keep strict LRU but make the locked region *only* the pointer surgery: bump the value's ref-count under the lock, release the lock, then copy/read the buffer **after** the lock is dropped (the ref-count taken under the lock keeps it alive even if a concurrent eviction unlinks the node).
- **Do all expensive work outside the lock:** allocation of the new `Node`/value buffer, hashing, and especially WAL `fsync` happen before/after the locked region, never inside it. The locked region is just: map lookup/insert, list splice, counter update, and (when durable) a cheap buffered WAL append (§6).
#### 5d. Deadlock avoidance
Single-key ops take exactly one shard lock → **no deadlock possible** (can't hold-and-wait across two locks).
The only deadlock risk is a *future* multi-key operation (e.g., an atomic multi-`Put`, or a background snapshot that wants several shards). Rules:
- **Global lock ordering:** always acquire shard locks in ascending shard-id order. A cycle in the wait-for graph is impossible if every thread acquires in the same total order.
- Or **`try_lock` + backoff**: lock shard A, `try_lock` shard B; on failure release A and retry, breaking hold-and-wait.
- Snapshots avoid the problem entirely by locking **one shard at a time** (see §6).
This is exactly the "eyeball the deadlock" follow-up the interviewer probes: the invariant to recite is *single lock per op, or a strict global acquisition order for multi-lock ops.*
### 6) Durability: Write-Ahead Log + snapshots
**Goal:** after crash/restart, rebuild cache contents. Approach: append-only **WAL** for incremental writes + periodic **snapshot** to bound WAL size.
#### WAL record format
Append-only, self-describing, corruption-detecting:
```text
[ len:4 ][ type:1 ][ lsn:8 ][ key_len:4 ][ key ][ val_len:4 ][ value ][ crc32:4 ]
type ∈ { PUT, DEL } // EVICT is logged as DEL to keep replay deterministic
```
- **Length prefix + trailing CRC** detect a torn final write (crash mid-append): on recovery, a record whose CRC fails or whose length runs past EOF is discarded — we stop replay there.
- **LSN** (monotonic log sequence number) lets us correlate WAL position with snapshots.
#### Write path (two durability modes — make it a knob)
The subtle correctness point: **the WAL must reflect the order in which operations took effect in memory.** If two threads `Put` the same key, the WAL order must match the in-memory winner, or recovery diverges from the live state. The way to guarantee this is non-negotiable: **assign the LSN and append the record under the same shard lock that applies the mutation**, so for any key, log order == apply order by construction. The append is just a buffered write (cheap); the `fsync` still happens after the lock is released.
```text
Put(k, v): # durable, canonical path
shard = shard_for(k)
rec = encode(PUT, k, v) # encode/alloc OUTSIDE the lock (no I/O here)
lock(shard)
lsn = wal.append(rec) # assign LSN + buffered append, UNDER the lock,
# so log order == in-memory apply order for this key.
# Buffered only — NO fsync inside the lock.
apply Put(k, v) in memory # §4 logic, same critical section
unlock(shard)
if mode == SYNC: wal.fsync_through(lsn) # fsync OUTSIDE the lock; ack only after durable
return ok
```
- **SYNC durability:** acknowledge only after `fsync` covers this record's LSN. Survives OS/power loss; higher latency. Use **group commit**: a background flusher `fsync`s once and wakes all writers whose LSN ≤ the flushed point — amortizes one `fsync` over many writes.
- **ASYNC durability:** a background thread flushes the WAL buffer periodically; `Put` acks immediately after `unlock`. Lower latency, may lose the last few ms on crash. Default.
Why append *inside* the shard lock rather than before it: appending before `lock(shard)` would decouple LSN-assignment order from in-memory-apply order — two concurrent `Put`s to the same key could append in one order but apply in the other, so the recovered last-writer would differ from the live last-writer. Doing both under the one lock removes that race. The cost is that WAL appends serialize *per shard* (not globally); since the WAL is the union of all shards' records ordered by LSN, and recovery only needs *per-key* last-writer-wins (independent keys are independent), per-shard append serialization is exactly sufficient. If a use case ever needed a single global total order of appends across keys, that would require a global log-append lock — explicitly out of scope for a cache recovering independent keys.
#### Recovery (on startup)
```text
recover():
load latest valid snapshot S # gives state up to snapshot_lsn
for rec in WAL where rec.lsn > snapshot_lsn:
if crc_ok(rec):
apply rec (PUT/DEL) into memory # in ascending LSN order
else:
break # torn/partial tail record → stop
# capacity may now be exceeded → run eviction to re-enforce caps
re-run evict_until_under_cap() per shard
```
Replaying `PUT`/`DEL` in LSN order reconstructs the map; because evictions were logged as `DEL`, replay is deterministic and won't resurrect evicted keys.
#### Snapshots / compaction (bound WAL growth)
The WAL grows unbounded; periodically snapshot and truncate:
- Background snapshot writes current contents to a new file and records `snapshot_lsn` (the highest LSN included).
- After the snapshot is durably written, **truncate the WAL** before `snapshot_lsn`.
- **Snapshotting without a long global pause:** iterate **shard by shard** — lock one shard, serialize its map, unlock, move to the next. This never holds more than one lock (deadlock-free) and never freezes the whole cache. The snapshot is then *fuzzy* (different shards captured at slightly different times); correctness is preserved because any write during the snapshot is also in the WAL after `snapshot_lsn`, so recovery (snapshot + WAL tail replay) is consistent. A copy-on-write page snapshot is the more advanced alternative but rarely needed in an interview.
#### Eviction-vs-durability semantics (be explicit)
Two coherent choices:
1. **WAL reconstructs exact in-memory state** → log evictions as `DEL` (chosen above). Replay reproduces the live cache exactly.
2. **WAL is best-effort warm-start** → log only `PUT`/`DEL` from clients, *don't* log evictions; on recovery the cache may temporarily exceed capacity and re-evicts itself. Simpler, but recovered set ⊇ live set.
I'd pick (1) for determinism and mention (2) as the cheaper alternative.
### 7) Bottlenecks & tradeoffs
| Concern | Choice | Tradeoff |
|---|---|---|
| Lock contention | Stripe into $S$ shards | Loses global LRU → "sharded LRU"; hit-rate slightly lower |
| Read cost (reads mutate recency) | Exact list-move under lock, **or** CLOCK/sampled LRU | Cheaper reads vs strict recency accuracy |
| Durability latency | SYNC + group commit vs ASYNC | Data loss window vs per-op latency |
| WAL growth | Snapshot + truncate | Snapshot CPU/IO + fuzzy-snapshot reasoning |
| Memory fragmentation at $10^7$ entries | Slab / arena allocator, pool `Node`s | Implementation complexity |
| Hot key in one shard | More shards / re-hash | Diminishing returns; a single hot key can't be parallelized |
| Use-after-free on eviction during read | Return ref-counted value buffers | Extra atomic ref-count ops |
**Key edge cases:** concurrent `Put`+`Get` on the same key (resolved by per-shard lock + ref-counted values); crash mid-WAL-record (CRC/length framing stops replay); a single value larger than a shard's `max_bytes` (reject or special-case "uncacheable" — otherwise `evict_until_under_cap` would loop forever trying to get under cap); `Put` updating an existing key must adjust `cur_bytes` by the delta, not the full size.
### 8) Summary
In-memory cache = **hash map + intrusive doubly linked list** for $O(1)$ `Get`/`Put`/`Delete`/LRU. Concurrency = **stripe into $S$ shards**, one lock each, so independent keys run in parallel; keep the critical section to pure pointer surgery plus a cheap buffered WAL append, and never `fsync` under a lock; deadlock-free because each op holds one lock (multi-lock ops use a strict global lock order). Durability = **WAL** (length-prefix + CRC framing, SYNC-with-group-commit or ASYNC modes) plus **per-shard fuzzy snapshots** to bound log size; recovery = load snapshot then replay WAL tail in LSN order, re-enforcing capacity. The non-negotiable invariants: assign each record's LSN under the same shard lock that applies the mutation (so log order == apply order per key), and keep disk I/O strictly outside critical sections. The deliberate cost is sharded (not global) LRU.