Diagnose distributed database inconsistency

Q: Diagnose distributed database inconsistency

This is a System Design interview question from Google for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Distributed Database Incident: Inconsistent Replicas

You are on-call for a distributed database that serves production traffic. Dashboards show data inconsistencies across replicas (stale reads, divergent row versions, missing writes). Assume a general setup that could be either leader-based (e.g., Raft/Paxos) or leaderless (e.g., Dynamo-style) with N replicas per shard.

Tasks

Likely Causes
- Enumerate plausible root causes for replica divergence, such as:
  - Replication lag and backpressure
  - Failed or flapping leader election / split brain
  - Clock skew / lease/timestamp issues
  - Network partition or asymmetric packet loss
  - Write conflicts and concurrent updates
  - Anti-entropy/repair pipeline failures (e.g., Merkle trees, hinted handoff)
  - Storage/WAL corruption, fsync/config issues, disk I/O saturation
  - Client read-consistency misconfiguration (reading from followers without safeguards)
Investigation Plan (Step-by-step)
- Present a prioritized, time-bounded plan to triage, scope, and diagnose the incident while minimizing blast radius. Include a decision tree for leader-based vs leaderless designs.
Telemetry and Logs to Inspect
- Specify the metrics, traces, and logs you would examine at:
  - Consensus/coordination layer (e.g., term/epoch, commit/applied index)
  - Replication layer (lag, queue depth, apply rate)
  - Storage/engine layer (WAL/L SN, compaction, fsync latency)
  - Network/host (loss, latency, CPU, GC pauses, disk)
  - Time sync (NTP/Chrony skew)
- Include what you’d look for in client-side telemetry (error rates, read staleness).
Verify Consistency Guarantees
- Describe how you would verify:
  - Read-after-write behavior
  - Quorum read/write behavior (R/W quorums with N replicas)
  - Monotonic reads and linearizability (when applicable)
- Outline small, targeted experiments/tests you would run in prod or a canary.
Mitigation and Repair Playbook
- Propose immediate mitigations and longer-term repairs, such as:
  - Adjust read/write quorums; pin reads to leader; disable follower reads
  - Fencing tokens/epochs; lease check tightening; disable elections on flapping nodes
  - NTP checks and remediations
  - Backfill/repair jobs (anti-entropy, snapshot restore, rebuild)
  - Throttling/backpressure and isolation of lagging replicas
- Include validation/guardrails to avoid further data loss or downtime.
Trade-offs During Incident Response
- Discuss how choices affect consistency, availability, and latency (CAP, quorum sizes, read paths), and how you’d communicate and choose among them during the incident.

Diagnose distributed database inconsistency

Distributed Database Incident: Inconsistent Replicas

Tasks

Solution (Locked)

Comments (0)