How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Technical Screen rounds at xAI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at xAI during technical interviews.

Design a Durable Key-Value Cache with File-System Persistence

Company: xAI

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

Design and implement (at a discussion + light-code level — you will sketch the key functions but not run them) a **durable in-memory key-value cache**. The cache lives in a single process on a single machine and exposes two operations: - `put(key, value)` — store or overwrite the value for a key. - `get(key)` — return the current value for a key (or a miss). **Durable** means: if the cache process crashes or is restarted, a fresh instance must be able to **recover the cached data from persistent storage** (the local file system) during initialization, so that no acknowledged `put` is lost. ### Constraints & Assumptions - Single node, single process; the local file system is the only persistent storage available (no external database). - Reads must stay fast — `get` should be served from memory, not from disk. - A `put` is only considered successful once it will survive a process crash. - The keyspace is small enough that all keys (and, initially, all values) fit in memory. - Crash model: the process can die at any instant, including mid-write; the disk itself is assumed healthy. - Interviewer expectation: sketch the data structures and the `put` / `get` / `init` (recovery) code paths in simple code or pseudocode, then reason about trade-offs out loud. ### Clarifying Questions to Ask - What is the durability requirement per write — must every single `put` survive a crash, or is a small window of loss acceptable? - What are the expected sizes: number of distinct keys, typical value size, and total data volume? - What is the read/write ratio, and how frequent are overwrites of the same key? - Is the cache accessed by a single thread, or must `put`/`get` be safe under concurrency? - How fast must recovery (restart) be — is a full scan of the persisted data on startup acceptable? - Do we need delete/expiry (TTL) semantics, or only `get`/`put`? ### Part 1 Design the core cache and its persistence strategy. Describe the in-memory data structure, what exactly gets written to disk on each `put`, and how `init()` reconstructs the cache after a crash. Sketch the code for `put`, `get`, and the recovery path. ```hint Persistence pattern Think **write-through**: apply the write to the in-memory map and *also* persist it to the file system before acknowledging. The disk copy is the source of truth for recovery; memory is just the fast serving layer. ``` ```hint On-disk layout An **append-only log** of `(key, value)` records is far friendlier to crash-safety and write latency than rewriting a data file in place. Recovery is then "replay the log from the beginning; later records win." ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 Analyze the trade-offs of your design. In particular: what does write-through cost you on the write path, and what happens to the persisted data over time when keys are overwritten many times — how do you handle a large accumulation of **stale key-value records**? ```hint Latency vs. durability Every durable `put` pays disk I/O — and *true* durability requires flushing (`fsync`), not just a buffered write. Consider what you can offer if the caller accepts group/periodic flushing instead of per-write flushing. ``` ```hint Stale data In an append-only log, an overwritten key's old records are dead weight: they inflate the file and slow recovery. The classic remedy is **compaction** — periodically rewrite a new log containing only the latest record per key, then atomically swap it in. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 Now suppose the number of keys is small, but each value is **very large** (e.g., many megabytes). The naive design rewrites or re-appends the whole value on every `put`, and recovery replays huge amounts of data. How would you optimize the design for this workload? ```hint Per-key isolation With few keys and huge values, a **file per key** stops one key's churn from bloating a shared log — and makes "find the latest value of key K" a single-file problem instead of a full-log scan. ``` ```hint Identifying the freshest version If each write of a key appends (or writes a new version file) tagged with a **timestamp or monotonically increasing version number**, then both `get`-after-recovery and cleanup reduce to "take the entry with the highest version; everything older is stale." ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - How would you make `put` and `get` safe under concurrent access from multiple threads, and what does that do to your write-path latency? - If a machine crash can corrupt the tail of a file mid-write, how does your recovery distinguish a torn final record from valid data? - When and how would you trigger compaction (size threshold, stale ratio, background thread), and how do you serve reads and writes *during* compaction? - If this cache had to survive whole-machine loss rather than just process crashes, what would you change (replication, remote storage), and what new consistency questions appear?

Quick Answer: This question evaluates a candidate's ability to design durable in-memory data structures and persistence strategies, including reasoning about crash recovery, durability guarantees, and performance trade-offs.

Design and implement (at a discussion + light-code level — you will sketch the key functions but not run them) a durable in-memory key-value cache.

The cache lives in a single process on a single machine and exposes two operations:

put(key, value) — store or overwrite the value for a key.
get(key) — return the current value for a key (or a miss).

Durable means: if the cache process crashes or is restarted, a fresh instance must be able to recover the cached data from persistent storage (the local file system) during initialization, so that no acknowledged put is lost.

Constraints & Assumptions

Single node, single process; the local file system is the only persistent storage available (no external database).
Reads must stay fast — get should be served from memory, not from disk.
A put is only considered successful once it will survive a process crash.
The keyspace is small enough that all keys (and, initially, all values) fit in memory.
Crash model: the process can die at any instant, including mid-write; the disk itself is assumed healthy.
Interviewer expectation: sketch the data structures and the put / get / init (recovery) code paths in simple code or pseudocode, then reason about trade-offs out loud.

Clarifying Questions to Ask

What is the durability requirement per write — must every single put survive a crash, or is a small window of loss acceptable?
What are the expected sizes: number of distinct keys, typical value size, and total data volume?
What is the read/write ratio, and how frequent are overwrites of the same key?
Is the cache accessed by a single thread, or must put / get be safe under concurrency?
How fast must recovery (restart) be — is a full scan of the persisted data on startup acceptable?
Do we need delete/expiry (TTL) semantics, or only get / put ?

Part 1

Design the core cache and its persistence strategy. Describe the in-memory data structure, what exactly gets written to disk on each put, and how init() reconstructs the cache after a crash. Sketch the code for put, get, and the recovery path.

What This Part Should Cover Premium

Part 2

Analyze the trade-offs of your design. In particular: what does write-through cost you on the write path, and what happens to the persisted data over time when keys are overwritten many times — how do you handle a large accumulation of stale key-value records?

What This Part Should Cover Premium

Part 3

Now suppose the number of keys is small, but each value is very large (e.g., many megabytes). The naive design rewrites or re-appends the whole value on every put, and recovery replays huge amounts of data. How would you optimize the design for this workload?

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

How would you make put and get safe under concurrent access from multiple threads, and what does that do to your write-path latency?
If a machine crash can corrupt the tail of a file mid-write, how does your recovery distinguish a torn final record from valid data?
When and how would you trigger compaction (size threshold, stale ratio, background thread), and how do you serve reads and writes during compaction?
If this cache had to survive whole-machine loss rather than just process crashes, what would you change (replication, remote storage), and what new consistency questions appear?

Design a Durable Key-Value Cache with File-System Persistence

Company: xAI

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

Design and implement (at a discussion + light-code level — you will sketch the key functions but not run them) a durable in-memory key-value cache.

The cache lives in a single process on a single machine and exposes two operations:

put(key, value) — store or overwrite the value for a key.
get(key) — return the current value for a key (or a miss).

Constraints & Assumptions

Single node, single process; the local file system is the only persistent storage available (no external database).
Reads must stay fast — get should be served from memory, not from disk.
A put is only considered successful once it will survive a process crash.
The keyspace is small enough that all keys (and, initially, all values) fit in memory.
Crash model: the process can die at any instant, including mid-write; the disk itself is assumed healthy.
Interviewer expectation: sketch the data structures and the put / get / init (recovery) code paths in simple code or pseudocode, then reason about trade-offs out loud.

Clarifying Questions to Ask

What is the durability requirement per write — must every single put survive a crash, or is a small window of loss acceptable?
What are the expected sizes: number of distinct keys, typical value size, and total data volume?
What is the read/write ratio, and how frequent are overwrites of the same key?
Is the cache accessed by a single thread, or must put / get be safe under concurrency?
How fast must recovery (restart) be — is a full scan of the persisted data on startup acceptable?
Do we need delete/expiry (TTL) semantics, or only get / put ?

Part 1

What This Part Should Cover Premium

Part 2

What This Part Should Cover Premium

Part 3

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

How would you make put and get safe under concurrent access from multiple threads, and what does that do to your write-path latency?
If a machine crash can corrupt the tail of a file mid-write, how does your recovery distinguish a torn final record from valid data?
When and how would you trigger compaction (size threshold, stale ratio, background thread), and how do you serve reads and writes during compaction?
If this cache had to survive whole-machine loss rather than just process crashes, what would you change (replication, remote storage), and what new consistency questions appear?

Design a Durable Key-Value Cache with File-System Persistence

Quick Overview

Design a Durable Key-Value Cache with File-System Persistence

Constraints & Assumptions

Clarifying Questions to Ask

Part 1

What This Part Should Cover Premium

Part 2

What This Part Should Cover Premium

Part 3

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a Durable Key-Value Cache with File-System Persistence

Quick Overview

Design a Durable Key-Value Cache with File-System Persistence

Constraints & Assumptions

Clarifying Questions to Ask

Part 1

What This Part Should Cover Premium

Part 2

What This Part Should Cover Premium

Part 3

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP