Scale your LRU cache to a distributed cache. Describe how you would shard keys across nodes using consistent hashing with virtual nodes, how rebalancing works during node joins/leaves, and how you would mitigate hot keys. Specify replication strategy, failure detection, request routing, and read/write paths. Discuss cache consistency (TTL, write-through/back, invalidation), fault tolerance under partitions, and monitoring/capacity planning. Provide trade-offs and expected performance.

This question evaluates a candidate's ability to design and scale a distributed caching system, testing competencies in sharding and consistent hashing, replication and consistency models, request routing, failure detection, cache invalidation, capacity planning, and operational trade-offs for high-throughput, low-latency workloads.

How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Technical Screen rounds at DoorDash.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at DoorDash during technical interviews.

Scale the cache to a distributed system | DoorDash Interview Question

Design: Scale a Single-Node LRU Cache to a Distributed Cache

Assume you are upgrading a single-node, in-memory LRU cache to a distributed cache to support high read throughput (10^5–10^6 QPS), moderate writes, millions of keys, and sub-millisecond p50 latency within a single region. Keys and values are arbitrary byte strings; items have TTLs and are evicted by LRU when memory is full. Availability target is 99.9%+.

Design the system and address the following:

Sharding

Describe how to shard keys across nodes using consistent hashing with virtual nodes (vnodes). Include how you map a key to its primary node and to replicas, and how you handle weighted nodes.

Rebalancing

Explain what happens during node joins and leaves (graceful and failures). Include data movement, ownership changes, and how to rate-limit rebalancing to protect tail latency.

Hot Keys

Propose techniques to mitigate hot keys and stampedes (e.g., heavy hitters, thundering herds).

Replication

Specify replication factor, placement, write propagation (sync/async), conflict resolution/versioning, and replica read preferences.

Failure Detection

Describe how nodes detect membership changes (e.g., gossip, heartbeats, thresholds) and how that integrates with routing.

Request Routing

Compare client-side routing vs proxy/sidecar. Explain how a client locates the right node(s) and handles failures/blacklisting.

Read/Write Paths

Provide read and write flows for: cache hit, miss + fill, and write-through/write-back/write-around policies.

Consistency & Invalidation

Discuss TTL handling (soft/hard TTL), write-through/back, cache invalidations (push vs pull), and dogpile prevention.

Fault Tolerance & Partitions

Explain behavior under node failures and network partitions. State the consistency-availability trade-offs you choose and why.

Monitoring & Capacity Planning

List the key metrics, SLOs, and an approach to plan capacity (memory, network, QPS). Include simple sizing formulas.

Trade-offs & Performance

Summarize trade-offs among consistency, availability, latency, and cost. Provide expected latency/throughput numbers and rebalancing cost at scale.

Design: Scale a Single-Node LRU Cache to a Distributed Cache

Design the system and address the following:

Sharding

Describe how to shard keys across nodes using consistent hashing with virtual nodes (vnodes). Include how you map a key to its primary node and to replicas, and how you handle weighted nodes.

Rebalancing

Explain what happens during node joins and leaves (graceful and failures). Include data movement, ownership changes, and how to rate-limit rebalancing to protect tail latency.

Hot Keys

Propose techniques to mitigate hot keys and stampedes (e.g., heavy hitters, thundering herds).

Replication

Specify replication factor, placement, write propagation (sync/async), conflict resolution/versioning, and replica read preferences.

Failure Detection

Describe how nodes detect membership changes (e.g., gossip, heartbeats, thresholds) and how that integrates with routing.

Request Routing

Compare client-side routing vs proxy/sidecar. Explain how a client locates the right node(s) and handles failures/blacklisting.

Read/Write Paths

Provide read and write flows for: cache hit, miss + fill, and write-through/write-back/write-around policies.

Consistency & Invalidation

Discuss TTL handling (soft/hard TTL), write-through/back, cache invalidations (push vs pull), and dogpile prevention.

Fault Tolerance & Partitions

Explain behavior under node failures and network partitions. State the consistency-availability trade-offs you choose and why.

Monitoring & Capacity Planning

List the key metrics, SLOs, and an approach to plan capacity (memory, network, QPS). Include simple sizing formulas.

Trade-offs & Performance

Summarize trade-offs among consistency, availability, latency, and cost. Provide expected latency/throughput numbers and rebalancing cost at scale.

Scale the cache to a distributed system

Quick Overview

Design: Scale a Single-Node LRU Cache to a Distributed Cache

Solution

Submit Your Answer to Earn 20XP

Scale the cache to a distributed system

Quick Overview

Design: Scale a Single-Node LRU Cache to a Distributed Cache

Solution

Submit Your Answer to Earn 20XP