Debug a cache incident end-to-end

Q: Debug a cache incident end-to-end

This question evaluates a candidate's ability to diagnose and debug production cache incidents, testing competencies in distributed caching, observability, incident response, root-cause analysis, and the operational impact on microservice latency.

Q: How do I approach Other / Miscellaneous interview questions?

Other / Miscellaneous questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master other / miscellaneous interviews.

Q: What difficulty level is this interview question?

This is a hard difficulty Other / Miscellaneous question, commonly asked during Technical Screen rounds at DoorDash.

Q: What role is this question designed for?

This question is commonly asked for Software Engineer candidates at DoorDash during technical interviews.

Question

Incident Scenario: Elevated Latency and Surge in Cache Misses

You are paged for elevated tail latency and a spike in cache misses in a microservice that uses a distributed cache (e.g., Redis or Memcached) in front of a database. Assume a typical read-through pattern and that cache health strongly affects service latency and downstream DB load.

Walk through your end-to-end debugging plan:

1) Clarifying Questions to Ask Immediately

Scope and timing:
- When did the issue start? Is it ongoing or intermittent? One region/shard or global?
- Any recent deploys, config/feature-flag changes, or infra changes (cache cluster scaling, failover)?
Traffic and impact:
- Any traffic spikes (marketing, batch jobs, experiments)? Which endpoints are affected?
- What’s the business impact and SLO/SLA at risk?
Cache details:
- Cache type/version, topology (cluster/shards/replicas), client library, eviction policy, memory limits.
- Key format changes, TTL policy changes, negative caching usage, L1 in-process caching present?
Dependency signals:
- Downstream DB/load, replica lag, network changes (cross-AZ routing), connection pool limits.

2) Dashboards and Metrics to Inspect

Service-level (Golden Signals):
- QPS, p50/p95/p99 latency, error rate, timeouts.
- Thread/concurrency queues, connection-pool utilization (waiting threads), CPU, memory, GC pauses.
Cache-level:
- Hit ratio = hits / (hits + misses).
- Miss reasons split if available (expired, not-found, key-changed, timeouts).
- Get/set QPS, p95/p99 command latency, timeouts.
- Eviction rate, memory used vs. limit, fragmentation.
- Per-shard/replica load, replication lag, blocked clients, connection count.
- Top/hot keys, big keys, key TTL distribution.
Downstream DB:
- QPS, CPU, slow queries, lock waits, pool saturation, replica lag.
Network/Infra:
- Cross-AZ/region latency, packet loss/retransmits, recent failovers/route changes.

3) Hypotheses and How You Would Test Them

Consider and prioritize based on signals; test with targeted checks.

H1: TTL misconfiguration or key schema change
- Signals: Sudden drop in hit ratio; spike in expired misses; increased set rate; recent config/deploy.
- Tests: Diff config/flags; sample key TTLs; compare old vs new key prefixes; dual-read a sample to see old keys still present.
H2: Capacity pressure causing evictions
- Signals: Memory near limit, eviction rate spike, cache CPU high; per-shard imbalance.
- Tests: Inspect memory/evictions per shard; check item size trends; confirm eviction policy.
H3: Hot keys
- Signals: A few keys dominate QPS; one shard saturated while others are fine.
- Tests: Top-keys/slowlog/monitor; per-key hit/miss; shard-level QPS.
H4: Thundering herd (cache stampede)
- Signals: Miss spikes in waves aligned with TTL expiry; many concurrent rebuilds per key.
- Tests: Check TTL distribution alignment; concurrency per key; logs for repeated recomputes.
H5: Hash-ring skew / poor partitioning
- Signals: One shard overloaded (CPU/latency/misses) while others idle.
- Tests: Compare per-shard metrics; review consistent hashing weights/slot assignments.
H6: Replication lag or read-from-replica issues
- Signals: High replica lag, stale reads, inconsistent availability by node.
- Tests: Check lag/offsets; compare primary vs replica latencies and miss rates.
H7: Network issues (cross-AZ, packet loss)
- Signals: Elevated cache command latency across services; correlated network alerts.
- Tests: Compare same-AZ vs cross-AZ RTT; packet loss/retransmits; traceroute.
H8: Client connection pool saturation or timeouts
- Signals: High pool wait time, timeouts to cache, rising p99 latency.
- Tests: Pool usage vs max; thread dumps; client timeout errors.
H9: Serialization/big keys or payload bloat
- Signals: Increased command latency, network saturation, memory churn.
- Tests: Sample key sizes, payload profiler, compression ratio changes.

Small capacity example: If service QPS = 10k, hit ratio drops from 92% to 60%, DB QPS rises from 800 to 4,000. If DB capacity is 2,000 QPS, you’ll see DB saturation and further latency spikes.

4) Immediate Mitigations (Stabilize First)

Pick a minimal-risk set based on observed signals; avoid full cache flush.

Protect downstream:
- Enable circuit breakers/backpressure; rate-limit non-critical endpoints.
- Serve-stale-on-error and stale-while-revalidate for critical lookups.
Reduce misses and stampedes:
- Temporarily increase TTLs (with jitter) for most-used keys; pre-warm critical keys.
- Enable request coalescing/single-flight per key; add soft TTL + background refresh.
- Reinstate previous key schema or TTL if a recent change caused the drop.
Relieve capacity hot spots:
- Scale cache cluster (add memory/nodes); rebalance shards; raise connection limits.
- For hot keys: add an L1 in-process cache; micro-shard the hot key (replicate value across N suffixes) and randomize reads.
Client-side:
- Increase connection pool size moderately; ensure timeouts and retries are sane; prefer same-AZ endpoints.

Guardrails:

Canary changes; roll out gradually; watch p99 latency, hit ratio, eviction rate, and DB QPS.
Don’t FLUSHALL in production; warm before shifting traffic.

5) Targeted Tests to Confirm/Refute Hypotheses

Compare current vs baseline: hit ratio, eviction rate, per-shard QPS/latency.
Sample keys: TTLs, presence under old/new names; compute miss reason distribution.
For herd: measure concurrent recomputes per key before/after enabling coalescing.
For skew: temporarily rebalance weights or disable an overweight shard to verify shift.
For replica lag: route reads to primary on a small canary; check latency/miss improvement.

6) Longer-Term Fixes

Design patterns:
- Soft TTL + hard TTL; stale-while-revalidate; request coalescing per key.
- TTL jitter to prevent synchronized expiry; negative caching with short TTLs.
- L1 in-process cache (e.g., Caffeine) for hot or small entries.
- Hot-key defenses: micro-sharding, write fanout to multiple keys, or dedicated replicated cache group.
- Consistent hashing with bounded load; automate shard rebalancing.
- Prefer same-AZ cache routing; graceful failover; hedged requests for tail latency (use sparingly).
Observability:
- Track miss reasons, per-shard metrics, top-keys, key size histograms, replication lag.
- SLOs for cache hit ratio, cache p99, eviction rate; burn-rate alerts.
Safety and change management:
- Config gating and canaries for TTL/eviction policy changes.
- Backward-compatible key schema rollouts with dual-read/write and automated backfill.
Efficiency:
- Compress large values; cap payload sizes; structured encoding.
- Review eviction policy suitability (e.g., allkeys-lru vs volatile-lru) for workload.

7) Post-Incident Actions

Blameless RCA:
- Timeline, contributing factors, detection effectiveness, and clear corrective actions.
- Quantify impact on latency, error rate, and downstream DB.
Alerts and dashboards:
- Add alerts for hit ratio drops, eviction spikes, per-shard imbalance, replica lag, pool saturation.
- Dashboards with miss reasons, TTL distribution, hot keys, and cross-AZ latency.
Runbooks and readiness:
- Step-by-step triage for cache incidents; safe mitigations; rollback steps.
- Load tests and game days for stampede/hot-key/eviction scenarios.
- Pre-warm procedures and traffic ramp plans.

Formula recap:

Hit ratio = hits / (hits + misses)
Expected DB QPS ≈ Service QPS × Miss rate

Provide a concise narrative during the interview tying signals to hypotheses, tests, and mitigations, emphasizing protecting downstream systems, minimizing risk, and using canaries for changes.

Debug a cache incident end-to-end

Quick Overview

Incident Scenario: Elevated Latency and Surge in Cache Misses

1) Clarifying Questions to Ask Immediately

2) Dashboards and Metrics to Inspect

3) Hypotheses and How You Would Test Them

4) Immediate Mitigations (Stabilize First)

5) Targeted Tests to Confirm/Refute Hypotheses

6) Longer-Term Fixes

7) Post-Incident Actions

Solution

Comments (0)

Debug a cache incident end-to-end

Quick Overview

Incident Scenario: Elevated Latency and Surge in Cache Misses

1) Clarifying Questions to Ask Immediately

2) Dashboards and Metrics to Inspect

3) Hypotheses and How You Would Test Them

4) Immediate Mitigations (Stabilize First)

5) Targeted Tests to Confirm/Refute Hypotheses

6) Longer-Term Fixes

7) Post-Incident Actions

Solution

Comments (0)