Design a Distributed Key-Value Store
Company: Microsoft
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
Design a horizontally scalable distributed key-value store. The system stores opaque values keyed by a string key and supports `get`, `put`, and `delete` on a single key. It must scale across many commodity nodes, stay available when individual nodes fail, and redistribute data smoothly as nodes are added or removed. Focus your design on data partitioning (sharding), replication, rebalancing, and failure handling.
### Constraints & Assumptions
- Billions of keys; values range from a few bytes up to a few megabytes.
- Core operations are single-key `get`, `put`, and `delete`; range scans are out of scope unless asked.
- Target low single-digit-millisecond p99 latency for single-key operations and high availability (think 99.9%+).
- Runs on many commodity nodes spread across racks; multi-datacenter is an extension, not the baseline.
- Workload is read-heavy but with a meaningful write rate; data must be durable.
### Clarifying Questions to Ask
- What consistency model is required: strong consistency, or eventual consistency with read-your-writes? How much staleness is tolerable?
- What is the value-size distribution, and are accesses strictly point lookups or do we need range queries?
- What durability and replication guarantees are required? Single datacenter or multi-DC?
- What are the throughput and latency targets, and the read/write ratio?
- Do we need multi-key atomic operations or transactions, or is single-key atomicity enough?
### Part 1 — Partitioning
How do you map billions of keys across many nodes so that adding or removing a node moves as little data as possible, and how does a client find the node responsible for a key?
```hint Avoid full remaps
Plain `hash(key) % N` remaps almost everything when `N` changes. Use a scheme where membership changes only affect a small slice of the keyspace — consistent hashing.
```
```hint Smooth out the ring
A single hash position per node creates uneven load and lumpy data movement. Give each physical node many positions on the ring (virtual nodes) so load and rebalancing are evenly distributed.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Replication and consistency
Each key is stored on more than one node for durability and availability. How do you replicate, keep replicas consistent, and resolve concurrent or conflicting writes?
```hint Quorums
With `N` replicas, require `R` nodes to confirm a read and `W` to confirm a write. Choosing `R + W > N` guarantees a read overlaps the latest write. Tuning `R` and `W` trades latency against consistency.
```
```hint Reconciling concurrent writes
Decide up front how two concurrent writes to the same key are reconciled: last-writer-wins by timestamp (simple, can lose updates) or version vectors / vector clocks that detect conflicts for the application to merge.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Rebalancing and failure handling
What happens when a node is added, removed, or fails? Cover both transient and permanent failures, and how the cluster heals itself.
```hint Only the neighbors move
On the hash ring, adding or removing a node should only shift data between that node and its ring neighbors — not the whole cluster. Quantify which ranges move.
```
```hint Transient vs permanent
A node that is briefly unreachable is different from one that is gone for good. Use hinted handoff to absorb temporary failures, and a background anti-entropy process to repair permanent ones.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- Walk through a `put` with `N = 3`, `W = 2` while one of the three replicas is down. What happens to the write, and what will a later `R = 2` read observe?
- How do you handle a hotspot where a single key (or a small range) receives a disproportionate share of traffic?
- How would you add multi-key transactions or atomic batches, and what does that cost in latency and availability?
- How would you treat a 1 MB value differently from a 10-byte value to keep latency predictable?
Quick Answer: This system design question evaluates the ability to architect a horizontally scalable distributed key-value store, covering data partitioning, replication, and failure handling. It tests conceptual understanding of consistent hashing, quorum-based consistency, and rebalancing, skills commonly probed in system design interviews to assess architectural reasoning about scalability and fault tolerance.