Incident Scenario: Elevated Latency and Surge in Cache Misses
You are paged for elevated tail latency and a spike in cache misses in a microservice that uses a distributed cache (e.g., Redis or Memcached) in front of a database. Assume a typical read-through pattern and that cache health strongly affects service latency and downstream DB load.
Walk through your end-to-end debugging plan:
1) Clarifying Questions to Ask Immediately
-
Scope and timing:
-
When did the issue start? Is it ongoing or intermittent? One region/shard or global?
-
Any recent deploys, config/feature-flag changes, or infra changes (cache cluster scaling, failover)?
-
Traffic and impact:
-
Any traffic spikes (marketing, batch jobs, experiments)? Which endpoints are affected?
-
What’s the business impact and SLO/SLA at risk?
-
Cache details:
-
Cache type/version, topology (cluster/shards/replicas), client library, eviction policy, memory limits.
-
Key format changes, TTL policy changes, negative caching usage, L1 in-process caching present?
-
Dependency signals:
-
Downstream DB/load, replica lag, network changes (cross-AZ routing), connection pool limits.
2) Dashboards and Metrics to Inspect
-
Service-level (Golden Signals):
-
QPS, p50/p95/p99 latency, error rate, timeouts.
-
Thread/concurrency queues, connection-pool utilization (waiting threads), CPU, memory, GC pauses.
-
Cache-level:
-
Hit ratio = hits / (hits + misses).
-
Miss reasons split if available (expired, not-found, key-changed, timeouts).
-
Get/set QPS, p95/p99 command latency, timeouts.
-
Eviction rate, memory used vs. limit, fragmentation.
-
Per-shard/replica load, replication lag, blocked clients, connection count.
-
Top/hot keys, big keys, key TTL distribution.
-
Downstream DB:
-
QPS, CPU, slow queries, lock waits, pool saturation, replica lag.
-
Network/Infra:
-
Cross-AZ/region latency, packet loss/retransmits, recent failovers/route changes.
3) Hypotheses and How You Would Test Them
Consider and prioritize based on signals; test with targeted checks.
-
H1: TTL misconfiguration or key schema change
-
Signals: Sudden drop in hit ratio; spike in expired misses; increased set rate; recent config/deploy.
-
Tests: Diff config/flags; sample key TTLs; compare old vs new key prefixes; dual-read a sample to see old keys still present.
-
H2: Capacity pressure causing evictions
-
Signals: Memory near limit, eviction rate spike, cache CPU high; per-shard imbalance.
-
Tests: Inspect memory/evictions per shard; check item size trends; confirm eviction policy.
-
H3: Hot keys
-
Signals: A few keys dominate QPS; one shard saturated while others are fine.
-
Tests: Top-keys/slowlog/monitor; per-key hit/miss; shard-level QPS.
-
H4: Thundering herd (cache stampede)
-
Signals: Miss spikes in waves aligned with TTL expiry; many concurrent rebuilds per key.
-
Tests: Check TTL distribution alignment; concurrency per key; logs for repeated recomputes.
-
H5: Hash-ring skew / poor partitioning
-
Signals: One shard overloaded (CPU/latency/misses) while others idle.
-
Tests: Compare per-shard metrics; review consistent hashing weights/slot assignments.
-
H6: Replication lag or read-from-replica issues
-
Signals: High replica lag, stale reads, inconsistent availability by node.
-
Tests: Check lag/offsets; compare primary vs replica latencies and miss rates.
-
H7: Network issues (cross-AZ, packet loss)
-
Signals: Elevated cache command latency across services; correlated network alerts.
-
Tests: Compare same-AZ vs cross-AZ RTT; packet loss/retransmits; traceroute.
-
H8: Client connection pool saturation or timeouts
-
Signals: High pool wait time, timeouts to cache, rising p99 latency.
-
Tests: Pool usage vs max; thread dumps; client timeout errors.
-
H9: Serialization/big keys or payload bloat
-
Signals: Increased command latency, network saturation, memory churn.
-
Tests: Sample key sizes, payload profiler, compression ratio changes.
Small capacity example: If service QPS = 10k, hit ratio drops from 92% to 60%, DB QPS rises from 800 to 4,000. If DB capacity is 2,000 QPS, you’ll see DB saturation and further latency spikes.
4) Immediate Mitigations (Stabilize First)
Pick a minimal-risk set based on observed signals; avoid full cache flush.
-
Protect downstream:
-
Enable circuit breakers/backpressure; rate-limit non-critical endpoints.
-
Serve-stale-on-error and stale-while-revalidate for critical lookups.
-
Reduce misses and stampedes:
-
Temporarily increase TTLs (with jitter) for most-used keys; pre-warm critical keys.
-
Enable request coalescing/single-flight per key; add soft TTL + background refresh.
-
Reinstate previous key schema or TTL if a recent change caused the drop.
-
Relieve capacity hot spots:
-
Scale cache cluster (add memory/nodes); rebalance shards; raise connection limits.
-
For hot keys: add an L1 in-process cache; micro-shard the hot key (replicate value across N suffixes) and randomize reads.
-
Client-side:
-
Increase connection pool size moderately; ensure timeouts and retries are sane; prefer same-AZ endpoints.
Guardrails:
-
Canary changes; roll out gradually; watch p99 latency, hit ratio, eviction rate, and DB QPS.
-
Don’t FLUSHALL in production; warm before shifting traffic.
5) Targeted Tests to Confirm/Refute Hypotheses
-
Compare current vs baseline: hit ratio, eviction rate, per-shard QPS/latency.
-
Sample keys: TTLs, presence under old/new names; compute miss reason distribution.
-
For herd: measure concurrent recomputes per key before/after enabling coalescing.
-
For skew: temporarily rebalance weights or disable an overweight shard to verify shift.
-
For replica lag: route reads to primary on a small canary; check latency/miss improvement.
6) Longer-Term Fixes
-
Design patterns:
-
Soft TTL + hard TTL; stale-while-revalidate; request coalescing per key.
-
TTL jitter to prevent synchronized expiry; negative caching with short TTLs.
-
L1 in-process cache (e.g., Caffeine) for hot or small entries.
-
Hot-key defenses: micro-sharding, write fanout to multiple keys, or dedicated replicated cache group.
-
Consistent hashing with bounded load; automate shard rebalancing.
-
Prefer same-AZ cache routing; graceful failover; hedged requests for tail latency (use sparingly).
-
Observability:
-
Track miss reasons, per-shard metrics, top-keys, key size histograms, replication lag.
-
SLOs for cache hit ratio, cache p99, eviction rate; burn-rate alerts.
-
Safety and change management:
-
Config gating and canaries for TTL/eviction policy changes.
-
Backward-compatible key schema rollouts with dual-read/write and automated backfill.
-
Efficiency:
-
Compress large values; cap payload sizes; structured encoding.
-
Review eviction policy suitability (e.g., allkeys-lru vs volatile-lru) for workload.
7) Post-Incident Actions
-
Blameless RCA:
-
Timeline, contributing factors, detection effectiveness, and clear corrective actions.
-
Quantify impact on latency, error rate, and downstream DB.
-
Alerts and dashboards:
-
Add alerts for hit ratio drops, eviction spikes, per-shard imbalance, replica lag, pool saturation.
-
Dashboards with miss reasons, TTL distribution, hot keys, and cross-AZ latency.
-
Runbooks and readiness:
-
Step-by-step triage for cache incidents; safe mitigations; rollback steps.
-
Load tests and game days for stampede/hot-key/eviction scenarios.
-
Pre-warm procedures and traffic ramp plans.
Formula recap:
-
Hit ratio = hits / (hits + misses)
-
Expected DB QPS ≈ Service QPS × Miss rate
Provide a concise narrative during the interview tying signals to hypotheses, tests, and mitigations, emphasizing protecting downstream systems, minimizing risk, and using canaries for changes.