Design a Crash-Resilient LRU Cache
Company: Anthropic
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
You have an in-memory LRU cache with fixed capacity $N$ and the standard `get(key)` / `put(key, value)` operations, both $O(1)$ (a hash map plus a doubly linked list ordered most-recently-used to least-recently-used). Today the cache is fully volatile: when the process crashes or restarts, all entries are lost and every subsequent request is a cold miss until the cache warms up again.
You want the cache to **recover useful state after a process crash or restart** — coming back populated with (most of) its prior entries and a usable recency order, instead of starting empty.
Design a persistence strategy for this cache. Concretely, address:
- **What data must be persisted** to reconstruct the cache.
- **How to reconstruct both the key-value entries and the LRU ordering** on restart.
- **Whether to use snapshots, a write-ahead log (WAL), or both**, and why.
- **The trade-off between durability and latency** on the write path.
- **How recovery should work** after a crash, including partial/torn writes.
```hint Reframe the goal first
An LRU cache is normally volatile *by design* — entries are usually re-derivable from a source of truth (DB, upstream, recompute). Decide whether persistence is a **warm-start optimization** or whether the cache is the system of record; that single answer sets how strict durability must be.
```
```hint Two classic primitives
Reach for the database-recovery toolkit: a periodic **snapshot/checkpoint** of full state, and an append-only **write-ahead log** of mutations. Think about what each one bounds — data loss vs. recovery time vs. disk growth — before picking one or combining them.
```
```hint Persist ordering, not pointers
You don't persist the linked-list `prev`/`next` pointers or hash buckets — ordering is fully captured by the **sequence of keys, MRU→LRU**. On recovery you rebuild the list by re-inserting keys in that order.
```
```hint The expensive recency decision
`get` reorders recency, and `get` QPS dominates in a cache. Logging a "touch" on every `get` turns a read-heavy cache into a write-heavy log. Consider whether *exact* eviction order or just *cache warmth* (which keys are present) is what actually matters.
```
```hint Durability knob lives on the flush
The latency you pay is the disk flush. Think about `fsync`-per-write vs. batched/group `fsync` vs. page-cache-only, and what failure each survives — process crash (page cache outlives the process) vs. power loss / kernel panic.
```
### Constraints & Assumptions
- Single-node, single-process cache (the classic interview object); note the distributed extension only at the end.
- Fixed capacity, e.g. $N \approx 10^6$ entries; `get`/`put` must stay $O(1)$ and microsecond-fast on the hot path.
- Illustrative scale for sizing: average value $\approx 1\text{ KB}$, key $\approx 32\text{ B}$, write rate up to $\sim 50\text{k}$ `put`/s, with `get` QPS substantially higher.
- Assume a source of truth exists behind the cache unless you argue otherwise — but state your assumption explicitly, since it changes the durability target.
- Persistence must add minimal steady-state latency to `get`/`put`, and recovery must be bounded (not "replay all history").
### Clarifying Questions to Ask
- **What is the cache backing?** Is there an authoritative source we can repopulate from, or is the cached value the only copy? This decides warm-start-optimization vs. database-grade durability.
- **What is the cost of a cold start / a miss?** A cheap in-memory recompute makes persistence nearly worthless; an expensive cross-region DB call or ML inference makes a warm restart very valuable.
- **How exact must the recovered LRU ordering be?** Byte-identical eviction order, or "roughly warm" — this is the single biggest cost lever.
- **What durability must we survive?** A plain process crash (kernel intact, page cache survives) vs. power loss / kernel panic (requires `fsync`)?
- **What is the scale?** Entry count, value sizes, and write QPS — these size the snapshot and the WAL.
- **Is TTL/expiry supported?** If so, expiries must survive the restart gap.
### What a Strong Answer Covers
- **Reframes durability by use case:** recognizes the cache is usually a performance optimization over a source of truth, so persistence is best-effort warm-start by default, and durability is a configurable knob rather than always database-strict.
- **States what must be persisted:** key→value entries, recency as an MRU→LRU key sequence, and metadata (capacity, format version, a snapshot↔WAL offset, absolute expiries if TTLs exist) — and explicitly *not* the in-memory pointers.
- **Compares snapshot vs. WAL vs. both** on data loss (RPO), recovery time, disk growth, and steady-state cost, and lands on snapshot+WAL with a clear justification.
- **Handles the write path and ordering:** write-ahead ordering (log → maybe flush → apply), and a concrete `fsync` strategy with the durability/latency trade-off articulated.
- **Addresses the `get`-recency cost trap:** chooses approximate recency (don't log touches) by default and explains why warmth beats exact eviction order, with a path to exact order if required.
- **Specifies a correct recovery flow:** load latest valid snapshot, replay only the WAL suffix, tolerate torn-tail records (per-record CRC), enforce capacity, and degrade to a cold start when data is missing.
- **Names failure modes and atomicity:** crash-atomic snapshot replacement (tmp + fsync + rename + dir fsync), bounded WAL via truncation, and keeping the prior snapshot as a fallback.
### Follow-up Questions
- **CPU-bound vs. I/O-bound:** the cache logic is CPU-bound and microsecond-fast while persistence (`fsync`, snapshot writes) is I/O-bound and millisecond-slow. How do you keep them from blocking each other, and where do you draw the async/background boundary?
- How would you take a multi-second, ~1 GB snapshot **without stalling** live `get`/`put` traffic?
- If exact LRU eviction order after recovery were a hard requirement, what would you change, and what would it cost?
- Extend this to a **sharded / replicated** cache: how does per-shard persistence plus WAL replication to a follower change the failover story versus a single-node cold start?
Quick Answer: This question evaluates a candidate's competency in designing a crash-resilient LRU cache, including which data and ordering must be persisted, durability and recovery concerns, and trade-offs between latency and persistence.