Explain Aurora-style internals: WAL, MVCC, replication, recovery
Company: Amazon
Role: Software Engineer
Category: Software Engineering Fundamentals
Difficulty: hard
Interview Round: Technical Screen
## Database Internals Deep Dive
You are interviewing for a cloud database engine team (Aurora-like). Answer conceptually and with trade-offs.
### Questions
1. **Architecture**: How does a cloud-native database architecture (separating compute from distributed storage) differ from a traditional single-node database using local disks? What are the benefits and costs?
2. **Replication model**: Compare replication approaches:
- physical vs logical replication
- statement-based vs row-based vs log-based
- synchronous vs asynchronous
- quorum-based durability in distributed storage
When would you choose each, and what failure modes do they address?
3. **WAL / redo log**: What is the role of WAL (write-ahead logging) in durability and crash recovery? How do redo/undo relate to WAL? What must be forced to durable storage on commit?
4. **Crash recovery**: Walk through a typical recovery procedure after a crash (phases, checkpoints, replay). What changes when storage is distributed?
5. **MVCC**: Explain MVCC fundamentals:
- how reads avoid blocking writes
- snapshot visibility rules (high level)
- what metadata is stored per row/version
6. **High-concurrency writes**: Under heavy write load, how do lock conflicts happen and how can an engine reduce contention (e.g., latches vs locks, optimistic CC, partitioning/hotspot mitigation)?
7. **Trade-offs**: Discuss consistency vs latency vs throughput trade-offs in replication and commit paths (e.g., quorum write, 1-RTT vs 2-RTT commit, read-your-writes, failover behavior).
Quick Answer: This question evaluates knowledge of database storage engine and distributed database internals, covering write-ahead logging, MVCC, replication models, crash recovery, concurrency control, and trade-offs among consistency, latency, and throughput.