Scale the cache to a distributed system

Q: Scale the cache to a distributed system

This is a System Design interview question from DoorDash for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Design: Scale a Single-Node LRU Cache to a Distributed Cache

Assume you are upgrading a single-node, in-memory LRU cache to a distributed cache to support high read throughput (10^5–10^6 QPS), moderate writes, millions of keys, and sub-millisecond p50 latency within a single region. Keys and values are arbitrary byte strings; items have TTLs and are evicted by LRU when memory is full. Availability target is 99.9%+.

Design the system and address the following:

Sharding

Describe how to shard keys across nodes using consistent hashing with virtual nodes (vnodes). Include how you map a key to its primary node and to replicas, and how you handle weighted nodes.

Rebalancing

Explain what happens during node joins and leaves (graceful and failures). Include data movement, ownership changes, and how to rate-limit rebalancing to protect tail latency.

Hot Keys

Propose techniques to mitigate hot keys and stampedes (e.g., heavy hitters, thundering herds).

Replication

Specify replication factor, placement, write propagation (sync/async), conflict resolution/versioning, and replica read preferences.

Failure Detection

Describe how nodes detect membership changes (e.g., gossip, heartbeats, thresholds) and how that integrates with routing.

Request Routing

Compare client-side routing vs proxy/sidecar. Explain how a client locates the right node(s) and handles failures/blacklisting.

Read/Write Paths

Provide read and write flows for: cache hit, miss + fill, and write-through/write-back/write-around policies.

Consistency & Invalidation

Discuss TTL handling (soft/hard TTL), write-through/back, cache invalidations (push vs pull), and dogpile prevention.

Fault Tolerance & Partitions

Explain behavior under node failures and network partitions. State the consistency-availability trade-offs you choose and why.

Monitoring & Capacity Planning

List the key metrics, SLOs, and an approach to plan capacity (memory, network, QPS). Include simple sizing formulas.

Trade-offs & Performance

Summarize trade-offs among consistency, availability, latency, and cost. Provide expected latency/throughput numbers and rebalancing cost at scale.

Scale the cache to a distributed system

Design: Scale a Single-Node LRU Cache to a Distributed Cache

Solution (Locked)

Comments (0)