Design a single-node persistent in-memory cache
Company: Databricks
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
## Scenario
Design a **single-machine** cache used by a web service to handle read/write requests.
The cache should:
- Store key/value pairs in memory (fast reads)
- Support typical operations: `Get(key)`, `Put(key, value)`, `Delete(key)`
- Enforce an eviction policy such as **LRU** when memory capacity is reached
- Optionally provide **durability** so that after a process crash/restart, the cache can recover its contents
The interviewer expects:
- A concrete design with data structures
- Pseudocode-level APIs and key internal logic
- Discussion of concurrency (locks), minimizing critical sections, avoiding deadlocks
- How to scale on a single machine (e.g., sharding/striping to reduce lock contention)
Quick Answer: This question evaluates a candidate's skills in single-node cache design, including in-memory key/value storage, eviction policy management (e.g., LRU), durability considerations, concurrency control, and techniques for reducing contention such as sharding or striping.
Solution
## 1) Clarify requirements and make explicit assumptions
Ask/assume:
- **Key/value sizes**: assume variable sized values; capacity measured either in entry count or bytes.
- **Consistency**: single-process cache; operations should be linearizable per key (a `Get` sees the latest successful `Put` if ordered after it).
- **Durability mode** (optional): if enabled, after restart we recover to the last persisted state. Define whether a `Put` must be acknowledged only after being WAL-flushed (stronger durability) vs. async (faster).
- **Eviction**: LRU is acceptable; exact LRU vs approximate can be discussed.
## 2) Core in-memory data structures (O(1) average)
Classic LRU cache:
- `HashMap<Key, *Node>` for O(1) lookup.
- Doubly linked list for recency ordering:
- Head = most recently used
- Tail = least recently used
- `Node` contains: `key`, `value`, `prev`, `next`, and optionally `value_size`.
Operations:
- `Get(k)`: lookup in map; if found, move node to head.
- `Put(k,v)`:
- if exists: update value, move to head
- else: insert new node at head; if over capacity, evict from tail
- `Delete(k)`: remove from map and list.
### Eviction with byte capacity
Maintain `current_bytes` and `max_bytes`. On `Put`, if value grows, increase bytes and evict until `current_bytes <= max_bytes`.
## 3) Concurrency design (single machine, many threads)
### Option A: One global mutex (simple)
- Protect both map + list with one `std::mutex`.
- Easy correctness, but high contention.
### Option B: Sharded/striped cache (recommended)
Partition keys into `S` shards by hash:
- Each shard has its own `map`, `LRU list`, `capacity`, and `lock`.
- `shard = hash(key) % S`
- Greatly reduces lock contention.
**Tradeoff**: LRU becomes per-shard (not global). Usually acceptable; call it “sharded LRU”.
### Minimizing critical sections
Keep locked regions small:
- Do not perform slow I/O (WAL fsync, compaction) while holding shard locks.
- For `Get`, you must update recency (list move) which typically needs write access. Options:
- Always take exclusive lock (simplest).
- Or use a RW lock but still need write section for the move; complexity often not worth it.
### Deadlock avoidance rules
If an operation ever needs multiple shard locks (rare; e.g., future multi-key ops):
- Acquire locks in a strict global order (by shard id).
- Or use `try_lock` + backoff.
## 4) Persistence via WAL (Write-Ahead Log)
### WAL goals
- Recover cache contents after crash.
- Ensure operations can be replayed.
### WAL record format
Append-only log records like:
- `PUT key value [timestamp/version]`
- `DEL key`
Use length-prefix framing + checksum to detect torn writes.
### Write path options
1) **Sync durability** (strong):
- On `Put/Delete`, append to WAL and `fsync` before acknowledging.
- Pros: data survives OS crash.
- Cons: higher latency.
2) **Async durability** (common):
- Enqueue WAL records to a background flusher thread.
- Acknowledge immediately.
- Pros: low latency.
- Cons: may lose last N ms of writes on crash.
Make this an explicit knob.
### Recovery
On startup:
- Read WAL sequentially.
- Replay operations into in-memory cache.
- Stop at first corrupt/partial record (based on checksum/length).
### Checkpointing / compaction
WAL grows without bound; periodically:
- Write a snapshot file of current cache (or per-shard snapshot).
- Record snapshot “high-water mark” (LSN).
- Truncate WAL before that LSN.
Common pattern:
- Snapshot in background.
- During snapshot, either:
- briefly pause writes, or
- copy-on-write / record concurrent writes after snapshot LSN and replay them.
For an interview, a simple approach is acceptable:
- Acquire shard lock, serialize shard map to snapshot, release; repeat per shard.
## 5) Handling eviction with persistence
Decide semantics:
- If WAL is meant to reconstruct **exact in-memory state**, then evictions should also be logged (e.g., `EVICT key`) or treated as deletes.
- If durability is “best effort cache warm start”, you can persist only `Put/Delete` and let the cache re-evict during recovery based on capacity.
A clear, consistent choice:
- Treat eviction as `Delete` and log it (keeps replay deterministic).
## 6) API + pseudocode (illustrative)
### Shard structure
- `mutex m`
- `unordered_map<Key, Node*> map`
- `Node* head, *tail`
- `size_t bytes, max_bytes`
### `Get(key)`
- lock shard
- if not found: unlock, return miss
- move-to-head(node)
- copy value (or return shared_ptr)
- unlock
### `Put(key,value)`
- (optional) append WAL record (sync or enqueue)
- lock shard
- insert/update
- evict while over capacity:
- victim = tail
- remove victim from list+map
- (optional) log delete/evict
- unlock
Key point: don’t hold lock while doing expensive WAL fsync.
## 7) Testing and edge cases
- Concurrent `Put` and `Get` on same key.
- Crash mid-record: checksum/length prefix should prevent replaying garbage.
- Large values: copying vs reference-counted buffers.
- Hot keys: shard distribution; choose enough shards (e.g., 64/256).
- Clock/time not required for LRU; list order suffices.
## 8) What “scale” means here (single machine)
- **Threads**: sharding reduces contention.
- **Memory**: support max bytes; possibly use slab allocation.
- **Throughput**: batch WAL writes; group commit.
This covers the expected low-level focus: data structures, WAL durability, lock contention optimization, and deadlock-safe concurrency.