Distributed Storage, Replication, and Consistency

What's being tested

Interviewers are probing whether you can design stateful distributed systems that preserve correctness under failures, concurrency, and scale. The common thread is deciding what must be strongly consistent, what can be eventually consistent, and how replication, partitioning, ordering, retries, and recovery interact. Google cares because many SWE systems are not just stateless services behind load balancers; they maintain durable state, coordinate work, migrate data, or serve logs/files where silent corruption is worse than downtime. A strong answer makes explicit guarantees, failure assumptions, and tradeoffs instead of saying “use `Kafka`” or “use consensus” as a black box.

Core knowledge

Replication improves durability and availability by storing multiple copies of data, commonly via leader-follower replication, multi-leader replication, or leaderless replication. Leader-follower is simpler for ordering writes; leaderless systems like `Dynamo`-style stores rely on quorums and conflict resolution.
Quorum consistency uses read quorum $R$ , write quorum $W$ , and replica count $N$ . If $R + W > N$ , reads intersect the latest successful write quorum, giving stronger consistency assuming no sloppy quorum or stale conflict resolution. Typical settings are $N=3$ , $W=2$ , $R=2$ .
Consensus protocols such as Raft, Paxos, and systems like `ZooKeeper`/`etcd` elect leaders and replicate logs with majority agreement. Use consensus for metadata, leases, membership, and shard leadership; avoid putting every high-volume data write through a single global consensus group.
Consistency models must be named precisely: linearizability means operations appear instantaneous in real-time order; serializability means transactions are equivalent to some serial order; snapshot isolation gives reads from a consistent version but allows write skew; eventual consistency converges if writes stop.
Partitioning splits data by key range, hash, tenant, or time. Hash partitioning balances load but hurts range scans; range partitioning supports ordered reads but risks hot partitions. For append-heavy logs, partition by topic and partition id, then segment by offset/time.
Ordering guarantees are usually scoped, not global. Per-key ordering or per-partition ordering is achievable with a single leader per partition; global ordering across millions of keys requires a sequencer or timestamp oracle and becomes a throughput bottleneck.
Append-only logs are the backbone of systems like `Kafka`, `Pulsar`, and database WALs. They assign monotonically increasing offsets, batch fsyncs, replicate segments, and serve consumers by offset. Retention, compaction, and indexing decide whether the log is an event stream, durable queue, or changelog.
Idempotency is mandatory whenever clients retry after timeouts. Stripe-style idempotency keys, deterministic object names, compare-and-swap versions, and dedupe tables let servers safely convert “at least once” delivery into effectively-once side effects for a scoped operation.
Snapshots and change data capture matter in migrations. A consistent snapshot records state at time $T$ , while CDC streams changes after $T$ from a database log such as `MySQL` binlog or `Postgres` WAL. Correctness requires applying changes in commit order, at least per primary key.
Distributed transactions coordinate atomic changes across services or shards. Two-phase commit gives atomicity but can block if the coordinator fails; sagas trade atomicity for compensating actions; outbox/inbox patterns persist messages with local state to avoid dual-write bugs.
Failure recovery is designed around explicit failure modes: leader crash, follower lag, disk loss, network partition, duplicate request, partial write, and clock skew. Durable systems use write-ahead logs, checksums, fencing tokens, epochs, and replay from last committed offset.
Garbage collection and retention need correctness rules. In deduplicated storage, deleting bytes requires reference counts or mark-and-sweep over metadata; in logs, deleting segments must respect consumer lag, legal retention, compaction checkpoints, and replicated durability.

Worked example

For Design distributed log storage service, start by framing the API and guarantees: “Are logs append-only? Do producers require acknowledged durability? Is ordering per topic-partition sufficient? What retention and throughput targets should I assume?” A strong candidate would declare assumptions such as millions of writes per second, records addressed by (topic, partition, offset), replication factor 3, and per-partition ordering rather than global ordering. The answer can be organized around four pillars: partitioning and routing, write path and replication, read path and consumer offsets, and recovery/retention.

The write path would send each partition to a leader broker, append records to segment files, batch fsyncs, replicate to followers, and acknowledge when the configured in-sync replica quorum has persisted the batch. The read path would serve sequential reads by offset, using sparse indexes from offset to file position, with consumers storing committed offsets externally or in an internal compacted topic. One explicit tradeoff is latency versus durability: acknowledging after leader append is fast but risks data loss on leader failure, while `acks=all` with a min in-sync replica count raises latency but preserves committed records through single-node failure. Recovery should mention leader epochs or fencing so an old leader cannot accept writes after a partition. Close by saying that, with more time, you would detail rebalancing partitions, cross-zone replication, tiered storage, and exactly how retention interacts with slow consumers.

A second angle

For Design relational-to-NoSQL migration pipeline, the same consistency ideas appear but the goal is data equivalence rather than serving an append API. The core design is a consistent relational snapshot plus CDC replay, with idempotent writes into the NoSQL target and verification by counts, checksums, or sampled row comparisons. Ordering is scoped by primary key or aggregate root: if updates to the same user arrive out of order, the target may regress to stale state. Unlike the log service, where the log itself is the product, the migration pipeline uses a log as a correctness mechanism to bridge two storage systems. The interviewer may push on cutover: dual writes are tempting but dangerous, so a safer plan is snapshot, CDC catch-up, read shadowing, validation, then controlled traffic switch with rollback.

Common pitfalls

Pitfall: Treating replication as automatic correctness.

Saying “replicate to three nodes” is incomplete unless you define when a write is committed, how reads choose replicas, and what happens during leader failover. A better answer states the consistency target, quorum or leader rules, and how stale leaders are fenced with epochs, terms, or leases.

Pitfall: Confusing at-least-once delivery with exactly-once effects.

Retries, CDC reconnects, worker crashes, and coordinator timeouts all create duplicates. The right move is not to promise magical exactly-once networking; use idempotency keys, sequence numbers, version checks, or transactional outbox/inbox tables to make repeated processing safe.

Pitfall: Over-indexing on one named system.

“Use `Kafka`, ”use `Spanner`, or “use `S3` can sound practical but shallow if you cannot explain the underlying mechanics. Interviewers want to see partitioning, replication, ordering, compaction, failure recovery, and consistency decisions, even if you eventually map them to a real managed service.

Connections

Expect pivots into CAP/PACELC, distributed locking and leases, database isolation levels, consistent hashing, backpressure, and disaster recovery. For scheduler-style designs, the same principles show up as task state machines, lease-based worker ownership, idempotent execution, retry queues, and dead-letter handling.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts