Design a Distributed Key–Value Store (Technical Screen)
Context
You're designing a cloud-native, multi-tenant key–value (KV) storage service for internal ML/analytics platforms. The service must support billions of keys with low-latency reads/writes and be highly available across availability zones (AZs). Some workloads require conditional updates (CAS) and per-key read-after-write consistency; others occasionally need range scans (e.g., prefix scans for model features).
Functional Requirements
-
API supports: Get, Put/Upsert, Delete, Batch, Compare-And-Set (CAS), TTL per key, and optional range scans.
-
Data model: binary values; metadata includes TTL, version, and optional attributes.
-
Per-key read-after-write consistency.
-
Optional range scans (prefix or ordered key ranges).
Non-Functional Requirements
-
High availability across AZs.
-
Horizontal scalability to billions of keys.
-
Low latencies: reads/writes p99 under tight SLOs (assume single-digit ms within an AZ when cache hits; low tens of ms on stable storage access).
-
Durability.
-
Hot-key mitigation.
-
Observability (metrics, tracing, alerts).
Design Tasks
-
Define the API and data model, including error semantics and consistency options.
-
Choose sharding strategy: consistent hashing vs. range partitioning; justify how to support range scans.
-
Choose replication model: leader–follower vs. leaderless. Define read/write paths, quorum choices, and conflict resolution.
-
Select on-disk structures (e.g., LSM vs. B+Tree), compaction strategy, indexing, TTL handling, and caching.
-
Explain hot-key mitigation strategies.
-
Explain rebalancing, failure detection, recovery, backups, and disaster recovery.
-
Provide observability plan (metrics, tracing, alerting).
-
Discuss CAP trade-offs and tunable consistency.
-
Outline testing methodology for correctness, performance, and resilience.