How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Technical Screen rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Design a Crash-Resilient LRU Cache | Anthropic Interview Question

Design a Crash-Resilient LRU Cache

Company: Anthropic

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

You have an in-memory LRU cache with fixed capacity $N$ and the standard `get(key)` / `put(key, value)` operations, both $O(1)$ (a hash map plus a doubly linked list ordered most-recently-used to least-recently-used). Today the cache is fully volatile: when the process crashes or restarts, all entries are lost and every subsequent request is a cold miss until the cache warms up again. You want the cache to **recover useful state after a process crash or restart** — coming back populated with (most of) its prior entries and a usable recency order, instead of starting empty. Design a persistence strategy for this cache. Concretely, address: - **What data must be persisted** to reconstruct the cache. - **How to reconstruct both the key-value entries and the LRU ordering** on restart. - **Whether to use snapshots, a write-ahead log (WAL), or both**, and why. - **The trade-off between durability and latency** on the write path. - **How recovery should work** after a crash, including partial/torn writes. ```hint Reframe the goal first An LRU cache is normally volatile *by design* — entries are usually re-derivable from a source of truth (DB, upstream, recompute). Decide whether persistence is a **warm-start optimization** or whether the cache is the system of record; that single answer sets how strict durability must be. ``` ```hint Two classic primitives Reach for the database-recovery toolkit: a periodic **snapshot/checkpoint** of full state, and an append-only **write-ahead log** of mutations. Think about what each one bounds — data loss vs. recovery time vs. disk growth — before picking one or combining them. ``` ```hint Persist ordering, not pointers You don't persist the linked-list `prev`/`next` pointers or hash buckets — ordering is fully captured by the **sequence of keys, MRU→LRU**. On recovery you rebuild the list by re-inserting keys in that order. ``` ```hint The expensive recency decision `get` reorders recency, and `get` QPS dominates in a cache. Logging a "touch" on every `get` turns a read-heavy cache into a write-heavy log. Consider whether *exact* eviction order or just *cache warmth* (which keys are present) is what actually matters. ``` ```hint Durability knob lives on the flush The latency you pay is the disk flush. Think about `fsync`-per-write vs. batched/group `fsync` vs. page-cache-only, and what failure each survives — process crash (page cache outlives the process) vs. power loss / kernel panic. ``` ### Constraints & Assumptions - Single-node, single-process cache (the classic interview object); note the distributed extension only at the end. - Fixed capacity, e.g. $N \approx 10^6$ entries; `get`/`put` must stay $O(1)$ and microsecond-fast on the hot path. - Illustrative scale for sizing: average value $\approx 1\text{ KB}$, key $\approx 32\text{ B}$, write rate up to $\sim 50\text{k}$ `put`/s, with `get` QPS substantially higher. - Assume a source of truth exists behind the cache unless you argue otherwise — but state your assumption explicitly, since it changes the durability target. - Persistence must add minimal steady-state latency to `get`/`put`, and recovery must be bounded (not "replay all history"). ### Clarifying Questions to Ask - **What is the cache backing?** Is there an authoritative source we can repopulate from, or is the cached value the only copy? This decides warm-start-optimization vs. database-grade durability. - **What is the cost of a cold start / a miss?** A cheap in-memory recompute makes persistence nearly worthless; an expensive cross-region DB call or ML inference makes a warm restart very valuable. - **How exact must the recovered LRU ordering be?** Byte-identical eviction order, or "roughly warm" — this is the single biggest cost lever. - **What durability must we survive?** A plain process crash (kernel intact, page cache survives) vs. power loss / kernel panic (requires `fsync`)? - **What is the scale?** Entry count, value sizes, and write QPS — these size the snapshot and the WAL. - **Is TTL/expiry supported?** If so, expiries must survive the restart gap. ### What a Strong Answer Covers - **Reframes durability by use case:** recognizes the cache is usually a performance optimization over a source of truth, so persistence is best-effort warm-start by default, and durability is a configurable knob rather than always database-strict. - **States what must be persisted:** key→value entries, recency as an MRU→LRU key sequence, and metadata (capacity, format version, a snapshot↔WAL offset, absolute expiries if TTLs exist) — and explicitly *not* the in-memory pointers. - **Compares snapshot vs. WAL vs. both** on data loss (RPO), recovery time, disk growth, and steady-state cost, and lands on snapshot+WAL with a clear justification. - **Handles the write path and ordering:** write-ahead ordering (log → maybe flush → apply), and a concrete `fsync` strategy with the durability/latency trade-off articulated. - **Addresses the `get`-recency cost trap:** chooses approximate recency (don't log touches) by default and explains why warmth beats exact eviction order, with a path to exact order if required. - **Specifies a correct recovery flow:** load latest valid snapshot, replay only the WAL suffix, tolerate torn-tail records (per-record CRC), enforce capacity, and degrade to a cold start when data is missing. - **Names failure modes and atomicity:** crash-atomic snapshot replacement (tmp + fsync + rename + dir fsync), bounded WAL via truncation, and keeping the prior snapshot as a fallback. ### Follow-up Questions - **CPU-bound vs. I/O-bound:** the cache logic is CPU-bound and microsecond-fast while persistence (`fsync`, snapshot writes) is I/O-bound and millisecond-slow. How do you keep them from blocking each other, and where do you draw the async/background boundary? - How would you take a multi-second, ~1 GB snapshot **without stalling** live `get`/`put` traffic? - If exact LRU eviction order after recovery were a hard requirement, what would you change, and what would it cost? - Extend this to a **sharded / replicated** cache: how does per-shard persistence plus WAL replication to a follower change the failover story versus a single-node cold start?

Quick Answer: This question evaluates a candidate's competency in designing a crash-resilient LRU cache, including which data and ordering must be persisted, durability and recovery concerns, and trade-offs between latency and persistence.

You have an in-memory LRU cache with fixed capacity $N$ and the standard get(key) / put(key, value) operations, both $O(1)$ (a hash map plus a doubly linked list ordered most-recently-used to least-recently-used). Today the cache is fully volatile: when the process crashes or restarts, all entries are lost and every subsequent request is a cold miss until the cache warms up again.

You want the cache to recover useful state after a process crash or restart — coming back populated with (most of) its prior entries and a usable recency order, instead of starting empty.

Design a persistence strategy for this cache. Concretely, address:

What data must be persisted to reconstruct the cache.
How to reconstruct both the key-value entries and the LRU ordering on restart.
Whether to use snapshots, a write-ahead log (WAL), or both , and why.
The trade-off between durability and latency on the write path.
How recovery should work after a crash, including partial/torn writes.

Constraints & Assumptions

Single-node, single-process cache (the classic interview object); note the distributed extension only at the end.
Fixed capacity, e.g. $N \approx 10^6$ entries; get / put must stay $O(1)$ and microsecond-fast on the hot path.
Illustrative scale for sizing: average value $\approx 1\text{ KB}$ , key $\approx 32\text{ B}$ , write rate up to $\sim 50\text{k}$ put /s, with get QPS substantially higher.
Assume a source of truth exists behind the cache unless you argue otherwise — but state your assumption explicitly, since it changes the durability target.
Persistence must add minimal steady-state latency to get / put , and recovery must be bounded (not "replay all history").

Clarifying Questions to Ask

What is the cache backing? Is there an authoritative source we can repopulate from, or is the cached value the only copy? This decides warm-start-optimization vs. database-grade durability.
What is the cost of a cold start / a miss? A cheap in-memory recompute makes persistence nearly worthless; an expensive cross-region DB call or ML inference makes a warm restart very valuable.
How exact must the recovered LRU ordering be? Byte-identical eviction order, or "roughly warm" — this is the single biggest cost lever.
What durability must we survive? A plain process crash (kernel intact, page cache survives) vs. power loss / kernel panic (requires fsync )?
What is the scale? Entry count, value sizes, and write QPS — these size the snapshot and the WAL.
Is TTL/expiry supported? If so, expiries must survive the restart gap.

What a Strong Answer Covers

Reframes durability by use case: recognizes the cache is usually a performance optimization over a source of truth, so persistence is best-effort warm-start by default, and durability is a configurable knob rather than always database-strict.
States what must be persisted: key→value entries, recency as an MRU→LRU key sequence, and metadata (capacity, format version, a snapshot↔WAL offset, absolute expiries if TTLs exist) — and explicitly not the in-memory pointers.
Compares snapshot vs. WAL vs. both on data loss (RPO), recovery time, disk growth, and steady-state cost, and lands on snapshot+WAL with a clear justification.
Handles the write path and ordering: write-ahead ordering (log → maybe flush → apply), and a concrete fsync strategy with the durability/latency trade-off articulated.
Addresses the get-recency cost trap: chooses approximate recency (don't log touches) by default and explains why warmth beats exact eviction order, with a path to exact order if required.
Specifies a correct recovery flow: load latest valid snapshot, replay only the WAL suffix, tolerate torn-tail records (per-record CRC), enforce capacity, and degrade to a cold start when data is missing.
Names failure modes and atomicity: crash-atomic snapshot replacement (tmp + fsync + rename + dir fsync), bounded WAL via truncation, and keeping the prior snapshot as a fallback.

Follow-up Questions

CPU-bound vs. I/O-bound: the cache logic is CPU-bound and microsecond-fast while persistence ( fsync , snapshot writes) is I/O-bound and millisecond-slow. How do you keep them from blocking each other, and where do you draw the async/background boundary?
How would you take a multi-second, ~1 GB snapshot without stalling live get / put traffic?
If exact LRU eviction order after recovery were a hard requirement, what would you change, and what would it cost?
Extend this to a sharded / replicated cache: how does per-shard persistence plus WAL replication to a follower change the failover story versus a single-node cold start?

Design a Crash-Resilient LRU Cache

Company: Anthropic

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

You want the cache to recover useful state after a process crash or restart — coming back populated with (most of) its prior entries and a usable recency order, instead of starting empty.

Design a persistence strategy for this cache. Concretely, address:

What data must be persisted to reconstruct the cache.
How to reconstruct both the key-value entries and the LRU ordering on restart.
Whether to use snapshots, a write-ahead log (WAL), or both , and why.
The trade-off between durability and latency on the write path.
How recovery should work after a crash, including partial/torn writes.

Constraints & Assumptions

Single-node, single-process cache (the classic interview object); note the distributed extension only at the end.
Fixed capacity, e.g. $N \approx 10^6$ entries; get / put must stay $O(1)$ and microsecond-fast on the hot path.
Illustrative scale for sizing: average value $\approx 1\text{ KB}$ , key $\approx 32\text{ B}$ , write rate up to $\sim 50\text{k}$ put /s, with get QPS substantially higher.
Assume a source of truth exists behind the cache unless you argue otherwise — but state your assumption explicitly, since it changes the durability target.
Persistence must add minimal steady-state latency to get / put , and recovery must be bounded (not "replay all history").

Clarifying Questions to Ask

What is the cache backing? Is there an authoritative source we can repopulate from, or is the cached value the only copy? This decides warm-start-optimization vs. database-grade durability.
What is the cost of a cold start / a miss? A cheap in-memory recompute makes persistence nearly worthless; an expensive cross-region DB call or ML inference makes a warm restart very valuable.
How exact must the recovered LRU ordering be? Byte-identical eviction order, or "roughly warm" — this is the single biggest cost lever.
What durability must we survive? A plain process crash (kernel intact, page cache survives) vs. power loss / kernel panic (requires fsync )?
What is the scale? Entry count, value sizes, and write QPS — these size the snapshot and the WAL.
Is TTL/expiry supported? If so, expiries must survive the restart gap.

What a Strong Answer Covers

Reframes durability by use case: recognizes the cache is usually a performance optimization over a source of truth, so persistence is best-effort warm-start by default, and durability is a configurable knob rather than always database-strict.
States what must be persisted: key→value entries, recency as an MRU→LRU key sequence, and metadata (capacity, format version, a snapshot↔WAL offset, absolute expiries if TTLs exist) — and explicitly not the in-memory pointers.
Compares snapshot vs. WAL vs. both on data loss (RPO), recovery time, disk growth, and steady-state cost, and lands on snapshot+WAL with a clear justification.
Handles the write path and ordering: write-ahead ordering (log → maybe flush → apply), and a concrete fsync strategy with the durability/latency trade-off articulated.
Addresses the get-recency cost trap: chooses approximate recency (don't log touches) by default and explains why warmth beats exact eviction order, with a path to exact order if required.
Specifies a correct recovery flow: load latest valid snapshot, replay only the WAL suffix, tolerate torn-tail records (per-record CRC), enforce capacity, and degrade to a cold start when data is missing.
Names failure modes and atomicity: crash-atomic snapshot replacement (tmp + fsync + rename + dir fsync), bounded WAL via truncation, and keeping the prior snapshot as a fallback.

Follow-up Questions

CPU-bound vs. I/O-bound: the cache logic is CPU-bound and microsecond-fast while persistence ( fsync , snapshot writes) is I/O-bound and millisecond-slow. How do you keep them from blocking each other, and where do you draw the async/background boundary?
How would you take a multi-second, ~1 GB snapshot without stalling live get / put traffic?
If exact LRU eviction order after recovery were a hard requirement, what would you change, and what would it cost?
Extend this to a sharded / replicated cache: how does per-shard persistence plus WAL replication to a follower change the failover story versus a single-node cold start?

Design a Crash-Resilient LRU Cache

Quick Overview

Design a Crash-Resilient LRU Cache

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a Crash-Resilient LRU Cache

Quick Overview

Design a Crash-Resilient LRU Cache

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP