Design a disk-backed KV store under contention
Company: Snowflake
Role: Software Engineer
Category: System Design
Difficulty: easy
Interview Round: Onsite
## Design a Disk-Backed Key–Value Store Under Contention
Design a single-node **key–value store** with the following two defining characteristics:
- **Large values.** Values can be very large (think hundreds of KB to multiple MB each), so the value data must live primarily on **disk** rather than in RAM.
- **High contention.** Many threads/clients call the store concurrently, and the access pattern includes **hot keys** (a small number of keys receiving a disproportionate share of reads and writes). The interviewer cares most about *how you keep this design correct and fast under contention*, including how you would concretely implement `acquire`/`release` of locks (or argue for an alternative).
Your design should specify and justify:
- **APIs:** `Put(key, value)`, `Get(key)`, `Delete(key)`, and optionally `CompareAndSet(key, expectedVersion, value)` (CAS).
- **Durability guarantees:** precisely what is guaranteed once `Put` returns success.
- **Correctness under concurrency:** what a client may observe when many threads operate on the same key simultaneously.
### Constraints & Assumptions
- **Single node** to start; the design should sketch how it would scale out, but the core problem is local concurrency, not distributed consensus.
- **Per-key linearizability** is the target consistency model (a `Get` reflects the most recently acknowledged `Put`/`Delete` for that key) unless you explicitly argue for something weaker.
- **Durability:** once `Put` returns success, the value must survive a process crash (and a power loss, if `fsync` is honored).
- **Workload skew:** assume reads outnumber writes, but a handful of keys are extremely hot for both. Value sizes range from small to multi-MB.
- You control the storage format on local disk (SSD); you may assume POSIX file APIs (`pread`, `pwrite`, `fsync`).
### The Problem
Walk through the full design: on-disk layout and indexing, caching, the read and write paths (including durability), and—most importantly—the concurrency control that keeps it correct and contention-tolerant.
```hint Where to start
Values are large; loading them fully into memory to look up a key would be wasteful. Think about what the *minimal* in-memory structure is that tells you *where* on disk to find a given key's value — and whether that structure and the raw value bytes need the same concurrency protection.
```
```hint Durability
When `Put` returns success, the write must survive a crash. Think about what you need to persist *before* acknowledging, and what the cost of that persistence step is at high write throughput. Is there a way to amortize that cost across concurrent writers?
```
```hint Attacking contention
A single lock around all shared state is the obvious starting point — but it limits you to one operation at a time system-wide. What properties of the keyspace could you exploit to let unrelated keys proceed independently? And once you've done that, what additional techniques address the case where *related* operations on the same key still contend?
```
```hint Acquire/release sketch
Be ready to actually write `acquire(key)` / `release(key)`. Think carefully about which operations must be **atomic with respect to each other** — specifically, what could go wrong if two concurrent writers each read the current state of a key and then independently try to update it? What is the minimum the critical section must include to prevent that?
```
### Clarifying Questions to Ask
- Single node only, or do I need to design for sharding across machines and replication?
- What consistency does the caller expect—per-key linearizability, or is eventual consistency acceptable for higher throughput?
- On `Put` success, must the data survive power loss (real `fsync`), or just a process crash?
- What is the read/write ratio, the value-size distribution, and how skewed is the hot-key access?
- Is there a maximum value size, or must I support arbitrarily large blobs (streaming)?
- Do I need range scans / ordered iteration, or is point access (get by exact key) sufficient?
### What a Strong Answer Covers
- A clear separation of **in-memory index** from **on-disk value storage**, with a concrete record format (length-prefixed, checksummed, versioned, tombstones for deletes).
- A **durability story** that names WAL + an explicit `fsync` policy and explains the crash-recovery path (replay WAL / rebuild index).
- The **read path** using positional reads (`pread`) plus a layered cache strategy (OS page cache + an application cache for hot values).
- A **contention plan** that goes beyond one lock: sharding, lock striping, reader-writer locks, optimistic/versioned CAS, and a specific mechanism for the single-hot-key case.
- A correct **acquire/release locking** sketch with the critical section drawn precisely, and a justified choice of whether the disk append sits inside or outside the lock.
- **Compaction/GC** for reclaiming space from overwritten values without blocking readers (epoch/RCU or refcounting).
- Honest **trade-offs** (latency vs. throughput, simplicity vs. concurrency) and a sketch of how to scale out.
### Follow-up Questions
- A single key is being hammered by thousands of concurrent writers. Striping doesn't help because they all hash to the same stripe—how do you keep that key's throughput high without losing per-key linearizability?
- The background compactor is rewriting a segment while a reader holds an offset into it. How do you guarantee the reader never reads freed or relocated bytes?
- You crash *after* the WAL `fsync` but *before* the value segment append and index update. Walk through exactly how recovery reconstructs correct state.
- How would you support a value larger than RAM (e.g., a 5 GB blob) on both the write and read paths?
Quick Answer: This question evaluates systems-design and concurrency engineering skills, specifically the ability to reason about durable on-disk storage, indexing and caching for large values, and contention-tolerant concurrency control for hot keys.