PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/xAI

Design a Durable Key-Value Cache with File-System Persistence

Last updated: Jul 2, 2026

Quick Overview

This question evaluates a candidate's ability to design durable in-memory data structures and persistence strategies, including reasoning about crash recovery, durability guarantees, and performance trade-offs.

  • medium
  • xAI
  • System Design
  • Software Engineer

Design a Durable Key-Value Cache with File-System Persistence

Company: xAI

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

Design and implement (at a discussion + light-code level — you will sketch the key functions but not run them) a **durable in-memory key-value cache**. The cache lives in a single process on a single machine and exposes two operations: - `put(key, value)` — store or overwrite the value for a key. - `get(key)` — return the current value for a key (or a miss). **Durable** means: if the cache process crashes or is restarted, a fresh instance must be able to **recover the cached data from persistent storage** (the local file system) during initialization, so that no acknowledged `put` is lost. ### Constraints & Assumptions - Single node, single process; the local file system is the only persistent storage available (no external database). - Reads must stay fast — `get` should be served from memory, not from disk. - A `put` is only considered successful once it will survive a process crash. - The keyspace is small enough that all keys (and, initially, all values) fit in memory. - Crash model: the process can die at any instant, including mid-write; the disk itself is assumed healthy. - Interviewer expectation: sketch the data structures and the `put` / `get` / `init` (recovery) code paths in simple code or pseudocode, then reason about trade-offs out loud. ### Clarifying Questions to Ask - What is the durability requirement per write — must every single `put` survive a crash, or is a small window of loss acceptable? - What are the expected sizes: number of distinct keys, typical value size, and total data volume? - What is the read/write ratio, and how frequent are overwrites of the same key? - Is the cache accessed by a single thread, or must `put`/`get` be safe under concurrency? - How fast must recovery (restart) be — is a full scan of the persisted data on startup acceptable? - Do we need delete/expiry (TTL) semantics, or only `get`/`put`? ### Part 1 Design the core cache and its persistence strategy. Describe the in-memory data structure, what exactly gets written to disk on each `put`, and how `init()` reconstructs the cache after a crash. Sketch the code for `put`, `get`, and the recovery path. ```hint Persistence pattern Think **write-through**: apply the write to the in-memory map and *also* persist it to the file system before acknowledging. The disk copy is the source of truth for recovery; memory is just the fast serving layer. ``` ```hint On-disk layout An **append-only log** of `(key, value)` records is far friendlier to crash-safety and write latency than rewriting a data file in place. Recovery is then "replay the log from the beginning; later records win." ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 Analyze the trade-offs of your design. In particular: what does write-through cost you on the write path, and what happens to the persisted data over time when keys are overwritten many times — how do you handle a large accumulation of **stale key-value records**? ```hint Latency vs. durability Every durable `put` pays disk I/O — and *true* durability requires flushing (`fsync`), not just a buffered write. Consider what you can offer if the caller accepts group/periodic flushing instead of per-write flushing. ``` ```hint Stale data In an append-only log, an overwritten key's old records are dead weight: they inflate the file and slow recovery. The classic remedy is **compaction** — periodically rewrite a new log containing only the latest record per key, then atomically swap it in. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 Now suppose the number of keys is small, but each value is **very large** (e.g., many megabytes). The naive design rewrites or re-appends the whole value on every `put`, and recovery replays huge amounts of data. How would you optimize the design for this workload? ```hint Per-key isolation With few keys and huge values, a **file per key** stops one key's churn from bloating a shared log — and makes "find the latest value of key K" a single-file problem instead of a full-log scan. ``` ```hint Identifying the freshest version If each write of a key appends (or writes a new version file) tagged with a **timestamp or monotonically increasing version number**, then both `get`-after-recovery and cleanup reduce to "take the entry with the highest version; everything older is stale." ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - How would you make `put` and `get` safe under concurrent access from multiple threads, and what does that do to your write-path latency? - If a machine crash can corrupt the tail of a file mid-write, how does your recovery distinguish a torn final record from valid data? - When and how would you trigger compaction (size threshold, stale ratio, background thread), and how do you serve reads and writes *during* compaction? - If this cache had to survive whole-machine loss rather than just process crashes, what would you change (replication, remote storage), and what new consistency questions appear?

Quick Answer: This question evaluates a candidate's ability to design durable in-memory data structures and persistence strategies, including reasoning about crash recovery, durability guarantees, and performance trade-offs.

Related Interview Questions

  • Design and Implement a URL Shortening Service - xAI (hard)
  • Build a One-Pass Data Cleaning Pipeline - xAI (medium)
  • Design a backend for an online checkers game - xAI (medium)
  • Design a multi-level API rate limiter - xAI (easy)
  • Design a follower push-notification system - xAI (hard)
|Home/System Design/xAI

Design a Durable Key-Value Cache with File-System Persistence

xAI logo
xAI
Sep 21, 2025, 12:00 AM
mediumSoftware EngineerTechnical ScreenSystem Design
0
0

Design and implement (at a discussion + light-code level — you will sketch the key functions but not run them) a durable in-memory key-value cache.

The cache lives in a single process on a single machine and exposes two operations:

  • put(key, value) — store or overwrite the value for a key.
  • get(key) — return the current value for a key (or a miss).

Durable means: if the cache process crashes or is restarted, a fresh instance must be able to recover the cached data from persistent storage (the local file system) during initialization, so that no acknowledged put is lost.

Constraints & Assumptions

  • Single node, single process; the local file system is the only persistent storage available (no external database).
  • Reads must stay fast — get should be served from memory, not from disk.
  • A put is only considered successful once it will survive a process crash.
  • The keyspace is small enough that all keys (and, initially, all values) fit in memory.
  • Crash model: the process can die at any instant, including mid-write; the disk itself is assumed healthy.
  • Interviewer expectation: sketch the data structures and the put / get / init (recovery) code paths in simple code or pseudocode, then reason about trade-offs out loud.

Clarifying Questions to Ask

  • What is the durability requirement per write — must every single put survive a crash, or is a small window of loss acceptable?
  • What are the expected sizes: number of distinct keys, typical value size, and total data volume?
  • What is the read/write ratio, and how frequent are overwrites of the same key?
  • Is the cache accessed by a single thread, or must put / get be safe under concurrency?
  • How fast must recovery (restart) be — is a full scan of the persisted data on startup acceptable?
  • Do we need delete/expiry (TTL) semantics, or only get / put ?

Part 1

Design the core cache and its persistence strategy. Describe the in-memory data structure, what exactly gets written to disk on each put, and how init() reconstructs the cache after a crash. Sketch the code for put, get, and the recovery path.

What This Part Should Cover Premium

Part 2

Analyze the trade-offs of your design. In particular: what does write-through cost you on the write path, and what happens to the persisted data over time when keys are overwritten many times — how do you handle a large accumulation of stale key-value records?

What This Part Should Cover Premium

Part 3

Now suppose the number of keys is small, but each value is very large (e.g., many megabytes). The naive design rewrites or re-appends the whole value on every put, and recovery replays huge amounts of data. How would you optimize the design for this workload?

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • How would you make put and get safe under concurrent access from multiple threads, and what does that do to your write-path latency?
  • If a machine crash can corrupt the tail of a file mid-write, how does your recovery distinguish a torn final record from valid data?
  • When and how would you trigger compaction (size threshold, stale ratio, background thread), and how do you serve reads and writes during compaction?
  • If this cache had to survive whole-machine loss rather than just process crashes, what would you change (replication, remote storage), and what new consistency questions appear?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More xAI•More Software Engineer•xAI Software Engineer•xAI System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.