How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Onsite rounds at Amazon.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Amazon during technical interviews.

High Availability and Failover for a Primary-Replica Database

Q: High Availability and Failover for a Primary-Replica Database

This question evaluates a candidate's understanding of database high-availability design, including replication semantics, failure detection, and failover orchestration. It tests practical application of concepts like RTO, RPO, and split-brain avoidance in a primary-replica architecture, commonly asked to assess system design and reliability engineering skills.

High Availability and Failover for a Primary–Replica Database

Your team owns the relational database platform that backs a high-traffic transactional service. The deployment is a single primary that accepts all writes, plus several read replicas that serve read traffic and stay up to date via replication from the primary. Other teams build on top of this platform and expect it to "just stay up."

Design how this platform survives a failure of the primary. Concretely, cover how the system detects that the primary is unhealthy, how it fails over to a replica with minimal downtime and bounded data loss, and how it recovers the failed node and returns to a healthy steady state.

This is a reliability / high-availability design, not a greenfield product. Spend your time on replication semantics, detection, failover orchestration, and recovery — not on application features.

Constraints & Assumptions

Read-heavy transactional workload: roughly 50k reads/s and 5k writes/s at peak; a single primary can hold the write load, but losing it stops all writes.
Target availability is "a few nines": aim for an RTO (recovery time objective) on the order of tens of seconds for an unplanned primary loss, and a small, explicitly stated RPO (recovery point objective).
Replicas are deployed across multiple availability zones in one region. Treat cross-region/disaster recovery as out of scope unless raised as a follow-up.
The database engine supports physical/log-based streaming replication (e.g. write-ahead-log streaming with a monotonic log position such as an LSN, or GTID-based replication). You may pick a concrete engine but keep the reasoning general.
Clients connect through a layer you control (a proxy, service-discovery endpoint, or virtual IP) rather than hard-coding the primary's address.

Clarifying Questions to Ask

What RPO is acceptable — is zero data loss required, or is losing the last few unreplicated transactions tolerable in exchange for lower write latency?
What write-latency budget do we have? This decides whether replication can be synchronous/semi-synchronous or must be asynchronous.
Is fully automated failover desired, or is a human approval step acceptable (or required) for promotion?
How are clients expected to reconnect after a failover — can we mandate a proxy/endpoint, or must we support clients that cache a primary IP?
How many replicas, and across how many availability zones? Are any replicas reserved purely for HA (not serving reads)?
Are there downstream consumers of the replication stream (CDC, analytics, search indexers) whose ordering or position we must preserve across a failover?

Part 1 — Replication topology, routing, and durability knobs

Lay out the steady-state architecture: where writes go, how replicas stay current, and how reads and writes are routed to the right node. Then explain the durability/latency trade-off you would choose for replication (synchronous vs. semi-synchronous vs. asynchronous) and why, tying it back to the RPO you assumed.

Clarifying Questions for this Part

Does the read path require read-your-writes or any bounded-staleness guarantee, or are replicas allowed to lag freely?

What This Part Should Cover Premium

Part 2 — Detecting that the primary is down

Design the failure-detection mechanism. The hard part is not noticing an obviously dead node — it is deciding the primary is really down without being fooled by a transient blip, a slow GC pause, or a network partition that would lead to two primaries (split-brain).

What This Part Should Cover Premium

Part 3 — Failover and recovery

Once the primary is declared dead, walk through promoting a replica and returning the system to health: choosing which replica to promote, preventing the old primary from corrupting state, redirecting writes and re-pointing the remaining replicas, bounding/accounting for data loss, and eventually rejoining the recovered node.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

How would you extend this to multi-region disaster recovery, and what new RPO/RTO and consistency trade-offs appear when replication crosses a WAN?
Would you make failover fully automatic or keep a human in the loop, and how does that choice change as your false-positive rate and blast radius change?
A replica is promoted but was lagging by 8 seconds at the moment of failure. What exactly is lost, how do you detect it, and what can you tell affected clients?
How do you perform an online schema migration or a planned primary switchover (maintenance) reusing the same machinery, with zero or near-zero downtime?

High Availability and Failover for a Primary–Replica Database

This is a reliability / high-availability design, not a greenfield product. Spend your time on replication semantics, detection, failover orchestration, and recovery — not on application features.

Constraints & Assumptions

Read-heavy transactional workload: roughly 50k reads/s and 5k writes/s at peak; a single primary can hold the write load, but losing it stops all writes.
Target availability is "a few nines": aim for an RTO (recovery time objective) on the order of tens of seconds for an unplanned primary loss, and a small, explicitly stated RPO (recovery point objective).
Replicas are deployed across multiple availability zones in one region. Treat cross-region/disaster recovery as out of scope unless raised as a follow-up.
The database engine supports physical/log-based streaming replication (e.g. write-ahead-log streaming with a monotonic log position such as an LSN, or GTID-based replication). You may pick a concrete engine but keep the reasoning general.
Clients connect through a layer you control (a proxy, service-discovery endpoint, or virtual IP) rather than hard-coding the primary's address.

Clarifying Questions to Ask

What RPO is acceptable — is zero data loss required, or is losing the last few unreplicated transactions tolerable in exchange for lower write latency?
What write-latency budget do we have? This decides whether replication can be synchronous/semi-synchronous or must be asynchronous.
Is fully automated failover desired, or is a human approval step acceptable (or required) for promotion?
How are clients expected to reconnect after a failover — can we mandate a proxy/endpoint, or must we support clients that cache a primary IP?
How many replicas, and across how many availability zones? Are any replicas reserved purely for HA (not serving reads)?
Are there downstream consumers of the replication stream (CDC, analytics, search indexers) whose ordering or position we must preserve across a failover?

Part 1 — Replication topology, routing, and durability knobs

Clarifying Questions for this Part

Does the read path require read-your-writes or any bounded-staleness guarantee, or are replicas allowed to lag freely?

What This Part Should Cover Premium

Part 2 — Detecting that the primary is down

What This Part Should Cover Premium

Part 3 — Failover and recovery

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

How would you extend this to multi-region disaster recovery, and what new RPO/RTO and consistency trade-offs appear when replication crosses a WAN?
Would you make failover fully automatic or keep a human in the loop, and how does that choice change as your false-positive rate and blast radius change?
A replica is promoted but was lagging by 8 seconds at the moment of failure. What exactly is lost, how do you detect it, and what can you tell affected clients?
How do you perform an online schema migration or a planned primary switchover (maintenance) reusing the same machinery, with zero or near-zero downtime?

High Availability and Failover for a Primary-Replica Database

Quick Overview

High Availability and Failover for a Primary-Replica Database

High Availability and Failover for a Primary–Replica Database

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Replication topology, routing, and durability knobs

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 2 — Detecting that the primary is down

What This Part Should Cover Premium

Part 3 — Failover and recovery

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

High Availability and Failover for a Primary-Replica Database

Quick Overview

High Availability and Failover for a Primary-Replica Database

High Availability and Failover for a Primary–Replica Database

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Replication topology, routing, and durability knobs

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 2 — Detecting that the primary is down

What This Part Should Cover Premium

Part 3 — Failover and recovery

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP