System Design: Distributed Key-Value Store
Design a distributed key-value (KV) storage service for a large-scale backend system.
The service should expose a simple interface for clients to:
-
Put a value by key
-
Get a value by key
-
Optionally delete a key
Assume:
-
Keys are strings up to 1 KB
-
Values are blobs up to 1 MB
-
Total data size can reach tens of TB
-
Target read and write latency at the 95th percentile is under 10 ms within a region
-
The system must be highly available and able to tolerate machine failures
Requirements
Clarify and then design for the following functional and non-functional requirements:
Functional requirements
-
API:
-
put(key, value)
-
get(key) -> value or key-not-found
-
delete(key)
-
Support for basic conditional updates is a plus (for example, put-if-absent or compare-and-set).
Non-functional requirements
-
High availability: service should keep working despite node failures
-
Horizontal scalability: must handle growth in traffic and data volume by adding machines
-
Durability: once acknowledged, writes should not be lost after a single-node failure
-
Reasonable consistency: you can choose strong or eventual consistency but must justify the trade-offs
-
Low latency for read and write operations
What to cover in your design
Provide a system design that addresses:
-
Data model and external API
-
High-level architecture and main components (client, stateless frontends, storage nodes, metadata service, etc.)
-
Data partitioning and placement strategy across nodes (for example, consistent hashing, sharding)
-
Replication strategy for fault tolerance (for example, primary-replica, quorum-based replication)
-
Consistency model choice and how reads and writes flow through the system
-
Storage layer design (in-memory vs on-disk, log-structured storage, write-ahead log, compaction)
-
Handling node failures and recovery (re-replication, leader election, rebalancing shards)
-
Scaling strategies (adding or removing nodes, rebalancing partitions)
-
Optional optimizations such as:
-
Caching strategies
-
Hot key handling
-
Multi-data-center replication and disaster recovery
Explain your assumptions and walk through read and write paths in your design. Highlight key trade-offs (for example, consistency vs availability) and how your design meets the stated requirements.