Debug a cache incident end-to-end

Q: Debug a cache incident end-to-end

This is a Other / Miscellaneous interview question from DoorDash for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Other / Miscellaneous interview questions?

Other / Miscellaneous questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master other / miscellaneous interviews.

Question

Incident Scenario: Elevated Latency and Surge in Cache Misses

You are paged for elevated tail latency and a spike in cache misses in a microservice that uses a distributed cache (e.g., Redis or Memcached) in front of a database. Assume a typical read-through pattern and that cache health strongly affects service latency and downstream DB load.

Walk through your end-to-end debugging plan:

1) Clarifying Questions to Ask Immediately

Scope and timing:
- When did the issue start? Is it ongoing or intermittent? One region/shard or global?
- Any recent deploys, config/feature-flag changes, or infra changes (cache cluster scaling, failover)?
Traffic and impact:
- Any traffic spikes (marketing, batch jobs, experiments)? Which endpoints are affected?
- What’s the business impact and SLO/SLA at risk?
Cache details:
- Cache type/version, topology (cluster/shards/replicas), client library, eviction policy, memory limits.
- Key format changes, TTL policy changes, negative caching usage, L1 in-process caching present?
Dependency signals:
- Downstream DB/load, replica lag, network changes (cross-AZ routing), connection pool limits.

2) Dashboards and Metrics to Inspect

Service-level (Golden Signals):
- QPS, p50/p95/p99 latency, error rate, timeouts.
- Thread/concurrency queues, connection-pool utilization (waiting threads), CPU, memory, GC pauses.
Cache-level:
- Hit ratio = hits / (hits + misses).
- Miss reasons split if available (expired, not-found, key-changed, timeouts).
- Get/set QPS, p95/p99 command latency, timeouts.
- Eviction rate, memory used vs. limit, fragmentation.
- Per-shard/replica load, replication lag, blocked clients, connection count.
- Top/hot keys, big keys, key TTL distribution.
Downstream DB:
- QPS, CPU, slow queries, lock waits, pool saturation, replica lag.
Network/Infra:
- Cross-AZ/region latency, packet loss/retransmits, recent failovers/route changes.

3) Hypotheses and How You Would Test Them

Consider and prioritize based on signals; test with targeted checks.

H1: TTL misconfiguration or key schema change
- Signals: Sudden drop in hit ratio; spike in expired misses; increased set rate; recent config/deploy.
- Tests: Diff config/flags; sample key TTLs; compare old vs new key prefixes; dual-read a sample to see old keys still present.
H2: Capacity pressure causing evictions
- Signals: Memory near limit, eviction rate spike, cache CPU high; per-shard imbalance.
- Tests: Inspect memory/evictions per shard; check item size trends; confirm eviction policy.
H3: Hot keys
- Signals: A few keys dominate QPS; one shard saturated while others are fine.
- Tests: Top-keys/slowlog/monitor; per-key hit/miss; shard-level QPS.
H4: Thundering herd (cache stampede)
- Signals: Miss spikes in waves aligned with TTL expiry; many concurrent rebuilds per key.
- Tests: Check TTL distribution alignment; concurrency per key; logs for repeated recomputes.
H5: Hash-ring skew / poor partitioning
- Signals: One shard overloaded (CPU/latency/misses) while others idle.
- Tests: Compare per-shard metrics; review consistent hashing weights/slot assignments.
H6: Replication lag or read-from-replica issues
- Signals: High replica lag, stale reads, inconsistent availability by node.
- Tests: Check lag/offsets; compare primary vs replica latencies and miss rates.
H7: Network issues (cross-AZ, packet loss)
- Signals: Elevated cache command latency across services; correlated network alerts.
- Tests: Compare same-AZ vs cross-AZ RTT; packet loss/retransmits; traceroute.
H8: Client connection pool saturation or timeouts
- Signals: High pool wait time, timeouts to cache, rising p99 latency.
- Tests: Pool usage vs max; thread dumps; client timeout errors.
H9: Serialization/big keys or payload bloat
- Signals: Increased command latency, network saturation, memory churn.
- Tests: Sample key sizes, payload profiler, compression ratio changes.

Small capacity example: If service QPS = 10k, hit ratio drops from 92% to 60%, DB QPS rises from 800 to 4,000. If DB capacity is 2,000 QPS, you’ll see DB saturation and further latency spikes.

4) Immediate Mitigations (Stabilize First)

Pick a minimal-risk set based on observed signals; avoid full cache flush.

Protect downstream:
- Enable circuit breakers/backpressure; rate-limit non-critical endpoints.
- Serve-stale-on-error and stale-while-revalidate for critical lookups.
Reduce misses and stampedes:
- Temporarily increase TTLs (with jitter) for most-used keys; pre-warm critical keys.
- Enable request coalescing/single-flight per key; add soft TTL + background refresh.
- Reinstate previous key schema or TTL if a recent change caused the drop.
Relieve capacity hot spots:
- Scale cache cluster (add memory/nodes); rebalance shards; raise connection limits.
- For hot keys: add an L1 in-process cache; micro-shard the hot key (replicate value across N suffixes) and randomize reads.
Client-side:
- Increase connection pool size moderately; ensure timeouts and retries are sane; prefer same-AZ endpoints.

Guardrails:

Canary changes; roll out gradually; watch p99 latency, hit ratio, eviction rate, and DB QPS.
Don’t FLUSHALL in production; warm before shifting traffic.

5) Targeted Tests to Confirm/Refute Hypotheses

Compare current vs baseline: hit ratio, eviction rate, per-shard QPS/latency.
Sample keys: TTLs, presence under old/new names; compute miss reason distribution.
For herd: measure concurrent recomputes per key before/after enabling coalescing.
For skew: temporarily rebalance weights or disable an overweight shard to verify shift.
For replica lag: route reads to primary on a small canary; check latency/miss improvement.

6) Longer-Term Fixes

Design patterns:
- Soft TTL + hard TTL; stale-while-revalidate; request coalescing per key.
- TTL jitter to prevent synchronized expiry; negative caching with short TTLs.
- L1 in-process cache (e.g., Caffeine) for hot or small entries.
- Hot-key defenses: micro-sharding, write fanout to multiple keys, or dedicated replicated cache group.
- Consistent hashing with bounded load; automate shard rebalancing.
- Prefer same-AZ cache routing; graceful failover; hedged requests for tail latency (use sparingly).
Observability:
- Track miss reasons, per-shard metrics, top-keys, key size histograms, replication lag.
- SLOs for cache hit ratio, cache p99, eviction rate; burn-rate alerts.
Safety and change management:
- Config gating and canaries for TTL/eviction policy changes.
- Backward-compatible key schema rollouts with dual-read/write and automated backfill.
Efficiency:
- Compress large values; cap payload sizes; structured encoding.
- Review eviction policy suitability (e.g., allkeys-lru vs volatile-lru) for workload.

7) Post-Incident Actions

Blameless RCA:
- Timeline, contributing factors, detection effectiveness, and clear corrective actions.
- Quantify impact on latency, error rate, and downstream DB.
Alerts and dashboards:
- Add alerts for hit ratio drops, eviction spikes, per-shard imbalance, replica lag, pool saturation.
- Dashboards with miss reasons, TTL distribution, hot keys, and cross-AZ latency.
Runbooks and readiness:
- Step-by-step triage for cache incidents; safe mitigations; rollback steps.
- Load tests and game days for stampede/hot-key/eviction scenarios.
- Pre-warm procedures and traffic ramp plans.

Formula recap:

Hit ratio = hits / (hits + misses)
Expected DB QPS ≈ Service QPS × Miss rate

Provide a concise narrative during the interview tying signals to hypotheses, tests, and mitigations, emphasizing protecting downstream systems, minimizing risk, and using canaries for changes.