Design a distributed key-value store

Q: Design a distributed key-value store

This is a System Design interview question from LinkedIn for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Design a Distributed Key–Value Store (Technical Screen)

Context

You're designing a cloud-native, multi-tenant key–value (KV) storage service for internal ML/analytics platforms. The service must support billions of keys with low-latency reads/writes and be highly available across availability zones (AZs). Some workloads require conditional updates (CAS) and per-key read-after-write consistency; others occasionally need range scans (e.g., prefix scans for model features).

Functional Requirements

API supports: Get, Put/Upsert, Delete, Batch, Compare-And-Set (CAS), TTL per key, and optional range scans.
Data model: binary values; metadata includes TTL, version, and optional attributes.
Per-key read-after-write consistency.
Optional range scans (prefix or ordered key ranges).

Non-Functional Requirements

High availability across AZs.
Horizontal scalability to billions of keys.
Low latencies: reads/writes p99 under tight SLOs (assume single-digit ms within an AZ when cache hits; low tens of ms on stable storage access).
Durability.
Hot-key mitigation.
Observability (metrics, tracing, alerts).

Design Tasks

Define the API and data model, including error semantics and consistency options.
Choose sharding strategy: consistent hashing vs. range partitioning; justify how to support range scans.
Choose replication model: leader–follower vs. leaderless. Define read/write paths, quorum choices, and conflict resolution.
Select on-disk structures (e.g., LSM vs. B+Tree), compaction strategy, indexing, TTL handling, and caching.
Explain hot-key mitigation strategies.
Explain rebalancing, failure detection, recovery, backups, and disaster recovery.
Provide observability plan (metrics, tracing, alerting).
Discuss CAP trade-offs and tunable consistency.
Outline testing methodology for correctness, performance, and resilience.