PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Google

Diagnose distributed database inconsistency

Last updated: Mar 29, 2026

Quick Overview

This question evaluates understanding of distributed systems consistency, replica replication and divergence, consensus protocols (leader-based and leaderless), and operational incident response skills such as telemetry interpretation and repair prioritization.

  • hard
  • Google
  • System Design
  • Software Engineer

Diagnose distributed database inconsistency

Company: Google

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

A distributed database is exhibiting data inconsistencies across replicas. Enumerate likely causes (e.g., replication lag, failed or flapping leader election, clock skew, network partition, write conflicts, anti-entropy failures) and provide a step-by-step investigation plan. Describe what telemetry and logs you would inspect, how you would verify consistency guarantees (read-after-write, quorum behavior), and how you would mitigate and repair (read/write quorums, fencing tokens, NTP checks, backfill/repair jobs, throttling). Discuss trade-offs among consistency, availability, and latency during incident response.

Quick Answer: This question evaluates understanding of distributed systems consistency, replica replication and divergence, consensus protocols (leader-based and leaderless), and operational incident response skills such as telemetry interpretation and repair prioritization.

Related Interview Questions

  • Design a Security Monitoring Framework - Google (medium)
  • Design an Online Coding Judge Platform - Google (medium)
  • Design Calendar Event Conflict Handling - Google (medium)
  • Design a pub-sub replay system - Google (hard)
  • How to host many domains on one IP? - Google (medium)
Google logo
Google
Sep 6, 2025, 12:00 AM
Software Engineer
Onsite
System Design
6
0

Distributed Database Incident: Inconsistent Replicas

You are on-call for a distributed database that serves production traffic. Dashboards show data inconsistencies across replicas (stale reads, divergent row versions, missing writes). Assume a general setup that could be either leader-based (e.g., Raft/Paxos) or leaderless (e.g., Dynamo-style) with N replicas per shard.

Tasks

  1. Likely Causes
    • Enumerate plausible root causes for replica divergence, such as:
      • Replication lag and backpressure
      • Failed or flapping leader election / split brain
      • Clock skew / lease/timestamp issues
      • Network partition or asymmetric packet loss
      • Write conflicts and concurrent updates
      • Anti-entropy/repair pipeline failures (e.g., Merkle trees, hinted handoff)
      • Storage/WAL corruption, fsync/config issues, disk I/O saturation
      • Client read-consistency misconfiguration (reading from followers without safeguards)
  2. Investigation Plan (Step-by-step)
    • Present a prioritized, time-bounded plan to triage, scope, and diagnose the incident while minimizing blast radius. Include a decision tree for leader-based vs leaderless designs.
  3. Telemetry and Logs to Inspect
    • Specify the metrics, traces, and logs you would examine at:
      • Consensus/coordination layer (e.g., term/epoch, commit/applied index)
      • Replication layer (lag, queue depth, apply rate)
      • Storage/engine layer (WAL/L SN, compaction, fsync latency)
      • Network/host (loss, latency, CPU, GC pauses, disk)
      • Time sync (NTP/Chrony skew)
    • Include what you’d look for in client-side telemetry (error rates, read staleness).
  4. Verify Consistency Guarantees
    • Describe how you would verify:
      • Read-after-write behavior
      • Quorum read/write behavior (R/W quorums with N replicas)
      • Monotonic reads and linearizability (when applicable)
    • Outline small, targeted experiments/tests you would run in prod or a canary.
  5. Mitigation and Repair Playbook
    • Propose immediate mitigations and longer-term repairs, such as:
      • Adjust read/write quorums; pin reads to leader; disable follower reads
      • Fencing tokens/epochs; lease check tightening; disable elections on flapping nodes
      • NTP checks and remediations
      • Backfill/repair jobs (anti-entropy, snapshot restore, rebuild)
      • Throttling/backpressure and isolation of lagging replicas
    • Include validation/guardrails to avoid further data loss or downtime.
  6. Trade-offs During Incident Response
    • Discuss how choices affect consistency, availability, and latency (CAP, quorum sizes, read paths), and how you’d communicate and choose among them during the incident.

Solution

Show

Submit Your Answer

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Google•More Software Engineer•Google Software Engineer•Google System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.