PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/Anthropic

Design a Crash-Resilient LRU Cache

Last updated: Jun 24, 2026

Quick Overview

This question evaluates a candidate's competency in designing a crash-resilient LRU cache, including which data and ordering must be persisted, durability and recovery concerns, and trade-offs between latency and persistence.

  • hard
  • Anthropic
  • System Design
  • Software Engineer

Design a Crash-Resilient LRU Cache

Company: Anthropic

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

You have an in-memory LRU cache with fixed capacity $N$ and the standard `get(key)` / `put(key, value)` operations, both $O(1)$ (a hash map plus a doubly linked list ordered most-recently-used to least-recently-used). Today the cache is fully volatile: when the process crashes or restarts, all entries are lost and every subsequent request is a cold miss until the cache warms up again. You want the cache to **recover useful state after a process crash or restart** — coming back populated with (most of) its prior entries and a usable recency order, instead of starting empty. Design a persistence strategy for this cache. Concretely, address: - **What data must be persisted** to reconstruct the cache. - **How to reconstruct both the key-value entries and the LRU ordering** on restart. - **Whether to use snapshots, a write-ahead log (WAL), or both**, and why. - **The trade-off between durability and latency** on the write path. - **How recovery should work** after a crash, including partial/torn writes. ```hint Reframe the goal first An LRU cache is normally volatile *by design* — entries are usually re-derivable from a source of truth (DB, upstream, recompute). Decide whether persistence is a **warm-start optimization** or whether the cache is the system of record; that single answer sets how strict durability must be. ``` ```hint Two classic primitives Reach for the database-recovery toolkit: a periodic **snapshot/checkpoint** of full state, and an append-only **write-ahead log** of mutations. Think about what each one bounds — data loss vs. recovery time vs. disk growth — before picking one or combining them. ``` ```hint Persist ordering, not pointers You don't persist the linked-list `prev`/`next` pointers or hash buckets — ordering is fully captured by the **sequence of keys, MRU→LRU**. On recovery you rebuild the list by re-inserting keys in that order. ``` ```hint The expensive recency decision `get` reorders recency, and `get` QPS dominates in a cache. Logging a "touch" on every `get` turns a read-heavy cache into a write-heavy log. Consider whether *exact* eviction order or just *cache warmth* (which keys are present) is what actually matters. ``` ```hint Durability knob lives on the flush The latency you pay is the disk flush. Think about `fsync`-per-write vs. batched/group `fsync` vs. page-cache-only, and what failure each survives — process crash (page cache outlives the process) vs. power loss / kernel panic. ``` ### Constraints & Assumptions - Single-node, single-process cache (the classic interview object); note the distributed extension only at the end. - Fixed capacity, e.g. $N \approx 10^6$ entries; `get`/`put` must stay $O(1)$ and microsecond-fast on the hot path. - Illustrative scale for sizing: average value $\approx 1\text{ KB}$, key $\approx 32\text{ B}$, write rate up to $\sim 50\text{k}$ `put`/s, with `get` QPS substantially higher. - Assume a source of truth exists behind the cache unless you argue otherwise — but state your assumption explicitly, since it changes the durability target. - Persistence must add minimal steady-state latency to `get`/`put`, and recovery must be bounded (not "replay all history"). ### Clarifying Questions to Ask - **What is the cache backing?** Is there an authoritative source we can repopulate from, or is the cached value the only copy? This decides warm-start-optimization vs. database-grade durability. - **What is the cost of a cold start / a miss?** A cheap in-memory recompute makes persistence nearly worthless; an expensive cross-region DB call or ML inference makes a warm restart very valuable. - **How exact must the recovered LRU ordering be?** Byte-identical eviction order, or "roughly warm" — this is the single biggest cost lever. - **What durability must we survive?** A plain process crash (kernel intact, page cache survives) vs. power loss / kernel panic (requires `fsync`)? - **What is the scale?** Entry count, value sizes, and write QPS — these size the snapshot and the WAL. - **Is TTL/expiry supported?** If so, expiries must survive the restart gap. ### What a Strong Answer Covers - **Reframes durability by use case:** recognizes the cache is usually a performance optimization over a source of truth, so persistence is best-effort warm-start by default, and durability is a configurable knob rather than always database-strict. - **States what must be persisted:** key→value entries, recency as an MRU→LRU key sequence, and metadata (capacity, format version, a snapshot↔WAL offset, absolute expiries if TTLs exist) — and explicitly *not* the in-memory pointers. - **Compares snapshot vs. WAL vs. both** on data loss (RPO), recovery time, disk growth, and steady-state cost, and lands on snapshot+WAL with a clear justification. - **Handles the write path and ordering:** write-ahead ordering (log → maybe flush → apply), and a concrete `fsync` strategy with the durability/latency trade-off articulated. - **Addresses the `get`-recency cost trap:** chooses approximate recency (don't log touches) by default and explains why warmth beats exact eviction order, with a path to exact order if required. - **Specifies a correct recovery flow:** load latest valid snapshot, replay only the WAL suffix, tolerate torn-tail records (per-record CRC), enforce capacity, and degrade to a cold start when data is missing. - **Names failure modes and atomicity:** crash-atomic snapshot replacement (tmp + fsync + rename + dir fsync), bounded WAL via truncation, and keeping the prior snapshot as a fallback. ### Follow-up Questions - **CPU-bound vs. I/O-bound:** the cache logic is CPU-bound and microsecond-fast while persistence (`fsync`, snapshot writes) is I/O-bound and millisecond-slow. How do you keep them from blocking each other, and where do you draw the async/background boundary? - How would you take a multi-second, ~1 GB snapshot **without stalling** live `get`/`put` traffic? - If exact LRU eviction order after recovery were a hard requirement, what would you change, and what would it cost? - Extend this to a **sharded / replicated** cache: how does per-shard persistence plus WAL replication to a follower change the failover story versus a single-node cold start?

Quick Answer: This question evaluates a candidate's competency in designing a crash-resilient LRU cache, including which data and ordering must be persisted, durability and recovery concerns, and trade-offs between latency and persistence.

Related Interview Questions

  • Design a One-on-One Chat Service - Anthropic (medium)
  • Design a prompt playground - Anthropic (hard)
  • Scale Duplicate File Detection - Anthropic (medium)
  • Design a one-to-one chat system - Anthropic (medium)
  • Design One-to-One Chat - Anthropic (medium)
|Home/System Design/Anthropic

Design a Crash-Resilient LRU Cache

Anthropic logo
Anthropic
Jan 6, 2026, 12:00 AM
hardSoftware EngineerTechnical ScreenSystem Design
74
0

You have an in-memory LRU cache with fixed capacity NNN and the standard get(key) / put(key, value) operations, both O(1)O(1)O(1) (a hash map plus a doubly linked list ordered most-recently-used to least-recently-used). Today the cache is fully volatile: when the process crashes or restarts, all entries are lost and every subsequent request is a cold miss until the cache warms up again.

You want the cache to recover useful state after a process crash or restart — coming back populated with (most of) its prior entries and a usable recency order, instead of starting empty.

Design a persistence strategy for this cache. Concretely, address:

  • What data must be persisted to reconstruct the cache.
  • How to reconstruct both the key-value entries and the LRU ordering on restart.
  • Whether to use snapshots, a write-ahead log (WAL), or both , and why.
  • The trade-off between durability and latency on the write path.
  • How recovery should work after a crash, including partial/torn writes.

Constraints & Assumptions

  • Single-node, single-process cache (the classic interview object); note the distributed extension only at the end.
  • Fixed capacity, e.g. N≈106N \approx 10^6N≈106 entries; get / put must stay O(1)O(1)O(1) and microsecond-fast on the hot path.
  • Illustrative scale for sizing: average value ≈1 KB\approx 1\text{ KB}≈1 KB , key ≈32 B\approx 32\text{ B}≈32 B , write rate up to ∼50k\sim 50\text{k}∼50k put /s, with get QPS substantially higher.
  • Assume a source of truth exists behind the cache unless you argue otherwise — but state your assumption explicitly, since it changes the durability target.
  • Persistence must add minimal steady-state latency to get / put , and recovery must be bounded (not "replay all history").

Clarifying Questions to Ask

  • What is the cache backing? Is there an authoritative source we can repopulate from, or is the cached value the only copy? This decides warm-start-optimization vs. database-grade durability.
  • What is the cost of a cold start / a miss? A cheap in-memory recompute makes persistence nearly worthless; an expensive cross-region DB call or ML inference makes a warm restart very valuable.
  • How exact must the recovered LRU ordering be? Byte-identical eviction order, or "roughly warm" — this is the single biggest cost lever.
  • What durability must we survive? A plain process crash (kernel intact, page cache survives) vs. power loss / kernel panic (requires fsync )?
  • What is the scale? Entry count, value sizes, and write QPS — these size the snapshot and the WAL.
  • Is TTL/expiry supported? If so, expiries must survive the restart gap.

What a Strong Answer Covers

  • Reframes durability by use case: recognizes the cache is usually a performance optimization over a source of truth, so persistence is best-effort warm-start by default, and durability is a configurable knob rather than always database-strict.
  • States what must be persisted: key→value entries, recency as an MRU→LRU key sequence, and metadata (capacity, format version, a snapshot↔WAL offset, absolute expiries if TTLs exist) — and explicitly not the in-memory pointers.
  • Compares snapshot vs. WAL vs. both on data loss (RPO), recovery time, disk growth, and steady-state cost, and lands on snapshot+WAL with a clear justification.
  • Handles the write path and ordering: write-ahead ordering (log → maybe flush → apply), and a concrete fsync strategy with the durability/latency trade-off articulated.
  • Addresses the get-recency cost trap: chooses approximate recency (don't log touches) by default and explains why warmth beats exact eviction order, with a path to exact order if required.
  • Specifies a correct recovery flow: load latest valid snapshot, replay only the WAL suffix, tolerate torn-tail records (per-record CRC), enforce capacity, and degrade to a cold start when data is missing.
  • Names failure modes and atomicity: crash-atomic snapshot replacement (tmp + fsync + rename + dir fsync), bounded WAL via truncation, and keeping the prior snapshot as a fallback.

Follow-up Questions

  • CPU-bound vs. I/O-bound: the cache logic is CPU-bound and microsecond-fast while persistence ( fsync , snapshot writes) is I/O-bound and millisecond-slow. How do you keep them from blocking each other, and where do you draw the async/background boundary?
  • How would you take a multi-second, ~1 GB snapshot without stalling live get / put traffic?
  • If exact LRU eviction order after recovery were a hard requirement, what would you change, and what would it cost?
  • Extend this to a sharded / replicated cache: how does per-shard persistence plus WAL replication to a follower change the failover story versus a single-node cold start?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.