Distributed Systems Consistency, Reliability, And Observability
Asked of: Software Engineer
Last updated
What's being tested
Interviewers are probing whether you can design distributed systems that remain correct and debuggable under concurrency, partial failure, retries, replication lag, and uneven load. For Google SWE interviews, the bar is not “draw boxes,” but reasoning clearly about consistency guarantees, failure semantics, operational reliability, and observability tradeoffs at large scale. Strong answers name the guarantee they provide, the guarantee they intentionally do not provide, and how they would detect when the system violates expectations. Expect follow-ups around hot keys, duplicate delivery, clock skew, regional outages, `p99` latency, and what happens when dependencies degrade.
Core knowledge
-
Consistency models define what clients may observe after reads and writes. Know the difference between linearizability, sequential consistency, read-your-writes, eventual consistency, and monotonic reads. In design interviews, explicitly choose the weakest model that satisfies product correctness.
-
CAP theorem is useful only if made concrete: during a network partition, a replicated service must choose between availability and strong consistency. For example, a distributed rate limiter may prefer availability with bounded over-admission, while account balance updates usually require consistency.
-
Quorums are the standard replication tradeoff: with replicas, write quorum , and read quorum , overlapping reads/writes require . For
`N=3`,`W=2`,`R=2`tolerates one replica failure while preserving quorum reads. -
Consensus algorithms like Raft and Paxos provide a replicated log for leader election, configuration changes, and strongly consistent metadata. Use them for small, critical control-plane state, not for every high-throughput data-plane operation.
-
Idempotency is mandatory for retries. A client-supplied
`Idempotency-Key`, dedupe table, or operation ID lets the server convert “maybe succeeded” into safe retry behavior. Store idempotency records with request hash, response, status, and TTL. -
At-least-once delivery means duplicates are expected; exactly-once semantics usually means “effectively once” through idempotent consumers plus atomic state transitions. A queue like
`Kafka`or`Pub/Sub`can redeliver after consumer crashes, offset races, or ack timeouts. -
Rate limiting algorithms have different consistency and burst properties. Token bucket allows bursts up to bucket size; leaky bucket smooths traffic; sliding window log is accurate but memory-heavy; sliding window counter approximates with lower memory. Distributed limiters often accept bounded error.
-
Partitioning controls scalability and hot spots. Consistent hashing reduces remapping when nodes change, but does not eliminate hot keys; use virtual nodes, key salting, local aggregation, or special-case sharding for tenants with extreme QPS.
-
Replication strategy shapes availability and latency. Leader-follower replication simplifies writes but can bottleneck and create stale reads; multi-leader improves regional writes but introduces conflicts; leaderless replication uses quorums and repair but complicates conflict resolution.
-
Failure handling should be explicit: timeouts, retries with exponential backoff and jitter, circuit breakers, dead-letter queues, and graceful degradation. Avoid retry storms by bounding retry budgets and propagating backpressure instead of letting every layer independently retry.
-
Observability requires three complementary signals: metrics for aggregate health, logs for discrete events, and distributed traces for request paths. Useful metrics include
`availability`,`error_rate`,`p50/p95/p99_latency`, queue depth, retry rate, dedupe hit rate, and replica lag. -
SLOs turn reliability into engineering decisions. If a service targets
`99.9%`monthly availability, the error budget is about 43.2 minutes/month. Tie design choices to SLOs: synchronous quorum writes may protect correctness but consume latency budget.
Worked example
For Design at-least-once notification delivery, a strong candidate starts by clarifying notification types, scale, ordering needs, acceptable duplicate rate, retry horizon, and whether “delivered” means accepted by an external provider or actually seen by a user. They might declare assumptions: `100M` notifications/day, multi-channel delivery, user-level idempotency required, and no strict global ordering. The answer should be organized around four pillars: ingestion API, durable queueing, delivery workers, and idempotent state tracking.
The ingestion API accepts a `notification_id` or generates one, validates tenant/user/channel, persists a send request, and enqueues work only after durable storage succeeds. Workers consume from `Kafka`, `Pub/Sub`, or a sharded internal queue, call providers with timeouts, and ack only after recording the outcome. Retries use exponential backoff with jitter and move poison messages to a dead-letter queue after a bounded policy such as 24 hours or 10 attempts. The key design decision is that the platform guarantees at-least-once processing, while downstream side effects are made effectively once using provider idempotency keys, a dedupe table keyed by `(tenant_id, notification_id, channel)`, and atomic status transitions.
A strong candidate explicitly calls out that ordering is expensive: per-user FIFO partitions reduce concurrency and can be broken by retries, so they would only guarantee ordering where the product requires it. They would close by adding observability: dashboards for send latency, provider error rate, retry backlog, dead-letter volume, duplicate suppression rate, and traces from API request to provider call. If they had more time, they would discuss regional failover, tenant isolation, abuse controls, and replay tooling.
A second angle
For Design a distributed rate limiter, the same consistency and reliability ideas apply, but the correctness envelope is different. Instead of ensuring every message is eventually processed, the system must make fast admission decisions under concurrent requests from many frontend servers. A strongly consistent centralized counter gives accurate limits but may add latency and become a hot spot; local token buckets or sharded counters are faster but can over-admit during synchronization windows. A good answer quantifies the tolerated error, for example allowing up to 1–2% overshoot for availability, while using `Redis`, in-memory cells, or regional aggregators depending on latency and failure requirements.
Common pitfalls
Pitfall: Treating consistency as binary.
A tempting answer is “use strong consistency everywhere” or “eventual consistency is fine.” Better answers tie guarantees to specific operations: rate-limit checks may tolerate bounded staleness, idempotency records need strong uniqueness, and key-value metadata may need quorum reads while analytics counters can be eventually consistent.
Pitfall: Drawing components without failure semantics.
Many candidates sketch `API -> queue -> workers -> database` and stop there. Interviewers want to hear what happens when the API times out after a successful write, when the queue redelivers, when a worker crashes after calling an external provider, or when a replica is stale.
Pitfall: Mentioning observability as an afterthought.
“Add logs and monitoring” is too shallow. Name concrete signals and alerts: `p99` admission latency for a rate limiter, replication lag for a key-value store, dead-letter queue growth for notifications, remediation success rate for monitoring automation, and trace sampling for slow cross-service calls.
Connections
Interviewers may pivot into storage engine internals such as LSM trees versus B-trees, load balancing and hot-key mitigation, schema/API design for idempotent operations, or incident response using SLOs and error budgets. They may also ask for a deeper comparison of `Redis`, `Spanner`, `Bigtable`, `DynamoDB`, `Kafka`, or `Pub/Sub` depending on whether the design is latency-critical, storage-heavy, or queue-centric.
Further reading
-
Dynamo: Amazon’s Highly Available Key-value Store — foundational paper on consistent hashing, quorums, vector clocks, and availability-first storage.
-
The Raft Consensus Algorithm — approachable consensus paper for replicated logs, leader election, and strong metadata consistency.
-
Google SRE Book — practical treatment of SLOs, error budgets, monitoring, alerting, and reliable production operations.
Featured in interview prep guides
Practice questions
- Design distributed transactions protocolGoogle · Software Engineer · Technical Screen · hard
- Explain SLI/SLO/SLA and design monitoringGoogle · Software Engineer · Onsite · hard
- Design distributed log storage serviceGoogle · Software Engineer · Technical Screen · hard
- Design viewing history and resume serviceGoogle · Software Engineer · Technical Screen · hard
- Design at-least-once notification deliveryGoogle · Software Engineer · Technical Screen · medium
- Troubleshoot CPU, latency, and DNS issuesGoogle · Software Engineer · Onsite · Medium
- Design a distributed rate limiterGoogle · Software Engineer · Onsite · hard
- Diagnose distributed database inconsistencyGoogle · Software Engineer · Onsite · hard
- Design a key-value storeGoogle · Software Engineer · Technical Screen · hard
- Design autonomous cloud monitoring and remediationGoogle · Software Engineer · Technical Screen · hard
- Design quota enforcement for high concurrencyGoogle · Software Engineer · Technical Screen · hard
- Design school-to-guardian messaging with acknowledgmentsGoogle · Software Engineer · Technical Screen · hard
Related concepts
- Distributed Systems Consistency And Low-Latency DesignSystem Design
- Distributed Storage, Replication, and ConsistencySystem Design
- Distributed Systems Correctness And IdempotencySystem Design
- Fault Tolerance, Idempotency, And Concurrency ControlSystem Design
- Scalable Distributed System ArchitectureSystem Design
- Distributed System Design For Ledgers And CountersSystem Design