High Availability and Failover for a Primary-Replica Database
Company: Amazon
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
# High Availability and Failover for a Primary–Replica Database
Your team owns the relational database platform that backs a high-traffic transactional service. The deployment is a **single primary** that accepts all writes, plus several **read replicas** that serve read traffic and stay up to date via replication from the primary. Other teams build on top of this platform and expect it to "just stay up."
Design how this platform survives a failure of the primary. Concretely, cover how the system **detects** that the primary is unhealthy, how it **fails over** to a replica with minimal downtime and bounded data loss, and how it **recovers** the failed node and returns to a healthy steady state.
This is a reliability / high-availability design, not a greenfield product. Spend your time on replication semantics, detection, failover orchestration, and recovery — not on application features.
### Constraints & Assumptions
- Read-heavy transactional workload: roughly 50k reads/s and 5k writes/s at peak; a single primary can hold the write load, but losing it stops all writes.
- Target availability is "a few nines": aim for an **RTO** (recovery time objective) on the order of tens of seconds for an unplanned primary loss, and a small, explicitly stated **RPO** (recovery point objective).
- Replicas are deployed across multiple availability zones in one region. Treat cross-region/disaster recovery as out of scope unless raised as a follow-up.
- The database engine supports physical/log-based streaming replication (e.g. write-ahead-log streaming with a monotonic log position such as an LSN, or GTID-based replication). You may pick a concrete engine but keep the reasoning general.
- Clients connect through a layer you control (a proxy, service-discovery endpoint, or virtual IP) rather than hard-coding the primary's address.
### Clarifying Questions to Ask
- What RPO is acceptable — is **zero data loss** required, or is losing the last few unreplicated transactions tolerable in exchange for lower write latency?
- What write-latency budget do we have? This decides whether replication can be synchronous/semi-synchronous or must be asynchronous.
- Is fully automated failover desired, or is a human approval step acceptable (or required) for promotion?
- How are clients expected to reconnect after a failover — can we mandate a proxy/endpoint, or must we support clients that cache a primary IP?
- How many replicas, and across how many availability zones? Are any replicas reserved purely for HA (not serving reads)?
- Are there downstream consumers of the replication stream (CDC, analytics, search indexers) whose ordering or position we must preserve across a failover?
### Part 1 — Replication topology, routing, and durability knobs
Lay out the steady-state architecture: where writes go, how replicas stay current, and how reads and writes are routed to the right node. Then explain the durability/latency trade-off you would choose for replication (synchronous vs. semi-synchronous vs. asynchronous) and why, tying it back to the RPO you assumed.
```hint Where to start
Separate the **data path** (who replicates from whom, and how committed transactions reach replicas) from the **control/routing path** (how a client discovers the current primary). They fail and evolve independently.
```
```hint Durability lever
The single biggest RPO lever is replication mode. Semi-synchronous replication — the primary waits for at least one replica to acknowledge each commit before returning success — bounds data loss to "transactions in flight to that one replica" while keeping the fan-out asynchronous for the rest.
```
#### Clarifying Questions for this Part
- Does the read path require read-your-writes or any bounded-staleness guarantee, or are replicas allowed to lag freely?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Detecting that the primary is down
Design the failure-detection mechanism. The hard part is not noticing an obviously dead node — it is deciding the primary is *really* down without being fooled by a transient blip, a slow GC pause, or a network partition that would lead to two primaries (split-brain).
```hint Don't trust a single observer
A failover decision driven by one watcher is one network glitch away from a false failover. Use multiple independent observers and require agreement — a quorum/witness model — before declaring the primary dead.
```
```hint The partition trap
Ask: if the network splits, can the old primary keep accepting writes on the minority side while a replica is promoted on the majority side? Tie "who is allowed to be primary" to holding a quorum/lease so only one side can win.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Failover and recovery
Once the primary is declared dead, walk through promoting a replica and returning the system to health: choosing which replica to promote, preventing the old primary from corrupting state, redirecting writes and re-pointing the remaining replicas, bounding/accounting for data loss, and eventually rejoining the recovered node.
```hint Pick the right replica
Among healthy replicas, promote the one that has applied the most of the primary's log — highest LSN / most recent GTID — to minimize lost transactions. Promoting an arbitrary replica needlessly widens the RPO.
```
```hint Stop the old primary first
Before or as you promote, **fence** the old primary (revoke its lease, block its network path, or STONITH) so it cannot accept writes after a replica takes over. Skipping fencing is the classic split-brain / data-divergence bug.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- How would you extend this to **multi-region** disaster recovery, and what new RPO/RTO and consistency trade-offs appear when replication crosses a WAN?
- Would you make failover **fully automatic** or keep a human in the loop, and how does that choice change as your false-positive rate and blast radius change?
- A replica is promoted but was lagging by 8 seconds at the moment of failure. What exactly is lost, how do you detect it, and what can you tell affected clients?
- How do you perform an online **schema migration** or a planned primary switchover (maintenance) reusing the same machinery, with zero or near-zero downtime?
Quick Answer: This question evaluates a candidate's understanding of database high-availability design, including replication semantics, failure detection, and failover orchestration. It tests practical application of concepts like RTO, RPO, and split-brain avoidance in a primary-replica architecture, commonly asked to assess system design and reliability engineering skills.