This question evaluates understanding of durable storage and crash-consistency, including on-disk data layout, filesystem durability semantics, concurrency control, and performance trade-offs for single-node embeddable key–value engines.
Design a durable key-value store that supports put(key, value), get(key), and delete(key). Specify how you will achieve durability with a write-ahead log (WAL), lay out on-disk data structures, and perform crash recovery. Describe compaction/garbage collection, fsync/batching strategy, and handling of torn writes. Explain your concurrency model: single-writer/multi-reader vs. multi-writer, lock-based vs. optimistic/lock-free approaches, and how you ensure correctness under concurrent access. Analyze consistency guarantees (e.g., read-after-write), latency/throughput trade-offs, and back-of-the-envelope resource estimates. Provide a minimal testing plan covering failure injection and recovery.
Quick Answer: This question evaluates understanding of durable storage and crash-consistency, including on-disk data layout, filesystem durability semantics, concurrency control, and performance trade-offs for single-node embeddable key–value engines.
Design a single-node, embeddable key–value store library with a simple API that must remain correct and durable across process crashes and power failures.
This is a single, cohesive system-design problem. Treat the numbered items under What to Cover as the areas your design must address, not as independent sub-problems — they all describe one storage engine.
Environment & Assumptions
A POSIX-like filesystem on
SSD/NVMe
with
4 KiB sectors
.
Single process, embedded as a library (no network/RPC layer to design).
API
put(key, value)
get(key)
delete(key)
Keys and values are
byte arrays
.
No secondary indexes
and
no transactions
beyond single-key operations.
Clarifying Questions to Ask
Before designing, scope the problem with the interviewer:
Durability contract
— must every acknowledged
put
survive a power loss (synchronous fsync-before-ack), or is a small, bounded loss window acceptable for higher throughput?
Workload shape
— is this write-heavy, read-heavy, or mixed? What is the read:write ratio, and is the access pattern point lookups only or do we anticipate range scans later?
Data sizes
— typical and maximum key/value sizes, total dataset size, and whether values can exceed available RAM (forcing on-disk structures vs. an in-memory map).
Concurrency expectations
— single application thread, or multiple threads issuing operations concurrently? Is multi-writer required, or is single-writer/multi-reader sufficient?
Consistency requirement
— is read-after-write (your own writes visible immediately) enough, or is full linearizability across threads expected?
Operational limits
— target throughput/latency SLOs, available disk headroom for compaction, and whether crashes/restarts are frequent enough that recovery time matters.
Constraints & Assumptions
Anchor the design with concrete numbers (state these as your assumptions if the interviewer leaves them open):
Storage:
SSD/NVMe, 4 KiB sector size; sequential writes are far cheaper than random writes.
Scale to size for:
order of
107
–
108
keys, ~16 B keys, ~1 KB values, dataset larger than RAM.
Throughput target:
on the order of tens of thousands of durable
put
/s (e.g. ~50k/s) on a single node.
Durability:
an acknowledged write in SYNC mode must survive process crash
and
power failure; any relaxed mode must make its loss window explicit and bounded.
Single node, single process:
no replication, no network layer, no distributed consensus in scope.
What to Cover
Walk through your design across the following areas. Hints sit under the areas where most candidates get stuck — open them only if you want a nudge.
Durability via a write-ahead log (WAL)
— record/segment format, rotation, and fsync policy.
On-disk data structures and layout
— how data files and metadata are organized.
Crash recovery
— how you detect and recover from incomplete operations.
Compaction / garbage collection
— reclaiming stale versions and tombstones.
fsync and batching strategy
— the throughput vs. latency trade-offs.
Torn writes and partial files
— how you detect and handle them.
Concurrency model
— single-writer/multi-reader vs. multi-writer; lock-based vs. optimistic/lock-free; and how you ensure correctness under concurrent access.
Consistency guarantees
— e.g., read-after-write, and any linearizability options.
Performance analysis
— latency/throughput trade-offs and back-of-the-envelope resource estimates.
Minimal testing plan
— including failure injection and recovery validation.
What a Strong Answer Covers
The interviewer is checking for these dimensions (this is a checklist of signals, not the answers):
Durability discipline
— a clear "nothing acked until durable" contract, the right fsync semantics, and an explicit (bounded) loss window if a relaxed mode is offered.
Crash-safety reasoning
— a crash-point-by-crash-point argument that a consistent state is always recoverable, anchored on atomic file installs and a single authoritative metadata version.
Data-structure fit
— a layout chosen to match SSD economics and the stated workload, with a justification (and awareness of the alternative, e.g. LSM vs. B+ tree, and when each wins).
Correctness under concurrency
— a defensible read path and write path, explicit ordering/sequence-number scheme, and safe memory reclamation; honesty about lock-based vs. lock-free trade-offs.
Quantitative grounding
— back-of-the-envelope throughput, write amplification, and memory estimates that identify the real bottleneck.
Testability
— a failure-injection and recovery-validation plan, not just happy-path unit tests.
Trade-off communication
— recommending a sensible default and naming the escape hatch, rather than over-engineering.
Follow-up Questions
Expect the interviewer to probe deeper after the main design:
Scaling writes:
"Your single-committer design is fsync-bound. How would you scale concurrent writers — and why might sharding the keyspace beat a lock-free multi-writer inside one engine?"
What breaks first at 100×?
"Ingest 100× higher: what saturates first — fsync rate, compaction write amplification, or block-cache misses — and how do you apply backpressure before L0 explodes?"
Range scans:
"The API is point-only today. What changes if we must support ordered range scans? Walk through a merging iterator over memtable + on-disk files and what it costs."
Adversarial durability:
"Prove a deleted key can never resurrect after a crash mid-compaction. Which exact invariant guarantees it?"