PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCareers
|Home/Other / Miscellaneous/DoorDash

Debug a cache incident end-to-end

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's ability to diagnose and debug production cache incidents, testing competencies in distributed caching, observability, incident response, root-cause analysis, and the operational impact on microservice latency.

  • hard
  • DoorDash
  • Other / Miscellaneous
  • Software Engineer

Debug a cache incident end-to-end

Company: DoorDash

Role: Software Engineer

Category: Other / Miscellaneous

Difficulty: hard

Interview Round: Technical Screen

You are paged for elevated latency and a surge in cache misses. Walk through your end-to-end debugging plan: what clarifying questions you ask the on-call/senior partner, which dashboards and metrics you inspect (hit ratio, eviction rate, QPS, p99 latency, connection pools, CPU/memory, network), and how you form and test hypotheses (TTL misconfiguration, capacity pressure, hot keys, thundering herd, hash-ring skew, replication lag, network issues). Propose immediate mitigations and a longer-term fix, then outline post-incident actions (blameless RCA, alerts, runbooks).

Quick Answer: This question evaluates a candidate's ability to diagnose and debug production cache incidents, testing competencies in distributed caching, observability, incident response, root-cause analysis, and the operational impact on microservice latency.

Related Interview Questions

  • Compare front-end state management approaches - DoorDash (hard)
  • Implement Bootstrap-like responsive utility classes - DoorDash (hard)
  • Compare frontend data management beyond Redux - DoorDash (hard)
DoorDash logo
DoorDash
Sep 6, 2025, 12:00 AM
Software Engineer
Technical Screen
Other / Miscellaneous
14
0

Incident Scenario: Elevated Latency and Surge in Cache Misses

You are paged for elevated tail latency and a spike in cache misses in a microservice that uses a distributed cache (e.g., Redis or Memcached) in front of a database. Assume a typical read-through pattern and that cache health strongly affects service latency and downstream DB load.

Walk through your end-to-end debugging plan:

1) Clarifying Questions to Ask Immediately

  • Scope and timing:
    • When did the issue start? Is it ongoing or intermittent? One region/shard or global?
    • Any recent deploys, config/feature-flag changes, or infra changes (cache cluster scaling, failover)?
  • Traffic and impact:
    • Any traffic spikes (marketing, batch jobs, experiments)? Which endpoints are affected?
    • What’s the business impact and SLO/SLA at risk?
  • Cache details:
    • Cache type/version, topology (cluster/shards/replicas), client library, eviction policy, memory limits.
    • Key format changes, TTL policy changes, negative caching usage, L1 in-process caching present?
  • Dependency signals:
    • Downstream DB/load, replica lag, network changes (cross-AZ routing), connection pool limits.

2) Dashboards and Metrics to Inspect

  • Service-level (Golden Signals):
    • QPS, p50/p95/p99 latency, error rate, timeouts.
    • Thread/concurrency queues, connection-pool utilization (waiting threads), CPU, memory, GC pauses.
  • Cache-level:
    • Hit ratio = hits / (hits + misses).
    • Miss reasons split if available (expired, not-found, key-changed, timeouts).
    • Get/set QPS, p95/p99 command latency, timeouts.
    • Eviction rate, memory used vs. limit, fragmentation.
    • Per-shard/replica load, replication lag, blocked clients, connection count.
    • Top/hot keys, big keys, key TTL distribution.
  • Downstream DB:
    • QPS, CPU, slow queries, lock waits, pool saturation, replica lag.
  • Network/Infra:
    • Cross-AZ/region latency, packet loss/retransmits, recent failovers/route changes.

3) Hypotheses and How You Would Test Them

Consider and prioritize based on signals; test with targeted checks.

  • H1: TTL misconfiguration or key schema change
    • Signals: Sudden drop in hit ratio; spike in expired misses; increased set rate; recent config/deploy.
    • Tests: Diff config/flags; sample key TTLs; compare old vs new key prefixes; dual-read a sample to see old keys still present.
  • H2: Capacity pressure causing evictions
    • Signals: Memory near limit, eviction rate spike, cache CPU high; per-shard imbalance.
    • Tests: Inspect memory/evictions per shard; check item size trends; confirm eviction policy.
  • H3: Hot keys
    • Signals: A few keys dominate QPS; one shard saturated while others are fine.
    • Tests: Top-keys/slowlog/monitor; per-key hit/miss; shard-level QPS.
  • H4: Thundering herd (cache stampede)
    • Signals: Miss spikes in waves aligned with TTL expiry; many concurrent rebuilds per key.
    • Tests: Check TTL distribution alignment; concurrency per key; logs for repeated recomputes.
  • H5: Hash-ring skew / poor partitioning
    • Signals: One shard overloaded (CPU/latency/misses) while others idle.
    • Tests: Compare per-shard metrics; review consistent hashing weights/slot assignments.
  • H6: Replication lag or read-from-replica issues
    • Signals: High replica lag, stale reads, inconsistent availability by node.
    • Tests: Check lag/offsets; compare primary vs replica latencies and miss rates.
  • H7: Network issues (cross-AZ, packet loss)
    • Signals: Elevated cache command latency across services; correlated network alerts.
    • Tests: Compare same-AZ vs cross-AZ RTT; packet loss/retransmits; traceroute.
  • H8: Client connection pool saturation or timeouts
    • Signals: High pool wait time, timeouts to cache, rising p99 latency.
    • Tests: Pool usage vs max; thread dumps; client timeout errors.
  • H9: Serialization/big keys or payload bloat
    • Signals: Increased command latency, network saturation, memory churn.
    • Tests: Sample key sizes, payload profiler, compression ratio changes.

Small capacity example: If service QPS = 10k, hit ratio drops from 92% to 60%, DB QPS rises from 800 to 4,000. If DB capacity is 2,000 QPS, you’ll see DB saturation and further latency spikes.

4) Immediate Mitigations (Stabilize First)

Pick a minimal-risk set based on observed signals; avoid full cache flush.

  • Protect downstream:
    • Enable circuit breakers/backpressure; rate-limit non-critical endpoints.
    • Serve-stale-on-error and stale-while-revalidate for critical lookups.
  • Reduce misses and stampedes:
    • Temporarily increase TTLs (with jitter) for most-used keys; pre-warm critical keys.
    • Enable request coalescing/single-flight per key; add soft TTL + background refresh.
    • Reinstate previous key schema or TTL if a recent change caused the drop.
  • Relieve capacity hot spots:
    • Scale cache cluster (add memory/nodes); rebalance shards; raise connection limits.
    • For hot keys: add an L1 in-process cache; micro-shard the hot key (replicate value across N suffixes) and randomize reads.
  • Client-side:
    • Increase connection pool size moderately; ensure timeouts and retries are sane; prefer same-AZ endpoints.

Guardrails:

  • Canary changes; roll out gradually; watch p99 latency, hit ratio, eviction rate, and DB QPS.
  • Don’t FLUSHALL in production; warm before shifting traffic.

5) Targeted Tests to Confirm/Refute Hypotheses

  • Compare current vs baseline: hit ratio, eviction rate, per-shard QPS/latency.
  • Sample keys: TTLs, presence under old/new names; compute miss reason distribution.
  • For herd: measure concurrent recomputes per key before/after enabling coalescing.
  • For skew: temporarily rebalance weights or disable an overweight shard to verify shift.
  • For replica lag: route reads to primary on a small canary; check latency/miss improvement.

6) Longer-Term Fixes

  • Design patterns:
    • Soft TTL + hard TTL; stale-while-revalidate; request coalescing per key.
    • TTL jitter to prevent synchronized expiry; negative caching with short TTLs.
    • L1 in-process cache (e.g., Caffeine) for hot or small entries.
    • Hot-key defenses: micro-sharding, write fanout to multiple keys, or dedicated replicated cache group.
    • Consistent hashing with bounded load; automate shard rebalancing.
    • Prefer same-AZ cache routing; graceful failover; hedged requests for tail latency (use sparingly).
  • Observability:
    • Track miss reasons, per-shard metrics, top-keys, key size histograms, replication lag.
    • SLOs for cache hit ratio, cache p99, eviction rate; burn-rate alerts.
  • Safety and change management:
    • Config gating and canaries for TTL/eviction policy changes.
    • Backward-compatible key schema rollouts with dual-read/write and automated backfill.
  • Efficiency:
    • Compress large values; cap payload sizes; structured encoding.
    • Review eviction policy suitability (e.g., allkeys-lru vs volatile-lru) for workload.

7) Post-Incident Actions

  • Blameless RCA:
    • Timeline, contributing factors, detection effectiveness, and clear corrective actions.
    • Quantify impact on latency, error rate, and downstream DB.
  • Alerts and dashboards:
    • Add alerts for hit ratio drops, eviction spikes, per-shard imbalance, replica lag, pool saturation.
    • Dashboards with miss reasons, TTL distribution, hot keys, and cross-AZ latency.
  • Runbooks and readiness:
    • Step-by-step triage for cache incidents; safe mitigations; rollback steps.
    • Load tests and game days for stampede/hot-key/eviction scenarios.
    • Pre-warm procedures and traffic ramp plans.

Formula recap:

  • Hit ratio = hits / (hits + misses)
  • Expected DB QPS ≈ Service QPS × Miss rate

Provide a concise narrative during the interview tying signals to hypotheses, tests, and mitigations, emphasizing protecting downstream systems, minimizing risk, and using canaries for changes.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Other / Miscellaneous•More DoorDash•More Software Engineer•DoorDash Software Engineer•DoorDash Other / Miscellaneous•Software Engineer Other / Miscellaneous
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • Careers
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.