Design a cloud database write path and recovery
Company: Amazon
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
## System Design (Engine-level): Write Path + Crash Recovery
Design a core subsystem for a cloud-native relational database (Aurora-like) where **compute is separated from durable distributed storage**.
### Goal
Support transactional writes with:
- high throughput
- low commit latency
- crash recovery
- strong durability guarantees (clearly specify what guarantees)
### Requirements / prompts
1. **Write path**: Describe how an UPDATE/INSERT flows from compute to durable storage. Where do you place the log (WAL)?
2. **Commit protocol**: When does a transaction commit succeed? What acknowledgements are required?
3. **Replication & consistency**: How many replicas, what quorum rules, and how do you handle network partitions?
4. **Crash recovery**: If the compute node crashes, how does a new node recover state and resume service? What data structures/checkpoints exist?
5. **Write amplification**: Identify sources (WAL, page rewrites, compaction) and propose reductions.
6. **Scalability**: How do you scale storage and compute independently? Discuss sharding, rebalancing, and hotspot handling.
7. **Observability**: What metrics and logs would you add to detect replication lag, redo backlog, and tail latency?
Quick Answer: This question evaluates expertise in designing transactional write paths and crash recovery for cloud-native relational databases that separate compute from durable distributed storage, focusing on durability guarantees, commit protocols, replication and quorum rules, crash recovery mechanisms, write amplification, scalability, and observability.