Distributed Database Incident: Inconsistent Replicas
You are on-call for a distributed database that serves production traffic. Dashboards show data inconsistencies across replicas (stale reads, divergent row versions, missing writes). Assume a general setup that could be either leader-based (e.g., Raft/Paxos) or leaderless (e.g., Dynamo-style) with N replicas per shard.
Tasks
-
Likely Causes
-
Enumerate plausible root causes for replica divergence, such as:
-
Replication lag and backpressure
-
Failed or flapping leader election / split brain
-
Clock skew / lease/timestamp issues
-
Network partition or asymmetric packet loss
-
Write conflicts and concurrent updates
-
Anti-entropy/repair pipeline failures (e.g., Merkle trees, hinted handoff)
-
Storage/WAL corruption, fsync/config issues, disk I/O saturation
-
Client read-consistency misconfiguration (reading from followers without safeguards)
-
Investigation Plan (Step-by-step)
-
Present a prioritized, time-bounded plan to triage, scope, and diagnose the incident while minimizing blast radius. Include a decision tree for leader-based vs leaderless designs.
-
Telemetry and Logs to Inspect
-
Specify the metrics, traces, and logs you would examine at:
-
Consensus/coordination layer (e.g., term/epoch, commit/applied index)
-
Replication layer (lag, queue depth, apply rate)
-
Storage/engine layer (WAL/L SN, compaction, fsync latency)
-
Network/host (loss, latency, CPU, GC pauses, disk)
-
Time sync (NTP/Chrony skew)
-
Include what you’d look for in client-side telemetry (error rates, read staleness).
-
Verify Consistency Guarantees
-
Describe how you would verify:
-
Read-after-write behavior
-
Quorum read/write behavior (R/W quorums with N replicas)
-
Monotonic reads and linearizability (when applicable)
-
Outline small, targeted experiments/tests you would run in prod or a canary.
-
Mitigation and Repair Playbook
-
Propose immediate mitigations and longer-term repairs, such as:
-
Adjust read/write quorums; pin reads to leader; disable follower reads
-
Fencing tokens/epochs; lease check tightening; disable elections on flapping nodes
-
NTP checks and remediations
-
Backfill/repair jobs (anti-entropy, snapshot restore, rebuild)
-
Throttling/backpressure and isolation of lagging replicas
-
Include validation/guardrails to avoid further data loss or downtime.
-
Trade-offs During Incident Response
-
Discuss how choices affect consistency, availability, and latency (CAP, quorum sizes, read paths), and how you’d communicate and choose among them during the incident.