System Design (Engine-level): Write Path + Crash Recovery
Design a core subsystem for a cloud-native relational database (Aurora-like) where compute is separated from durable distributed storage.
Goal
Support transactional writes with:
-
high throughput
-
low commit latency
-
crash recovery
-
strong durability guarantees (clearly specify what guarantees)
Requirements / prompts
-
Write path
: Describe how an UPDATE/INSERT flows from compute to durable storage. Where do you place the log (WAL)?
-
Commit protocol
: When does a transaction commit succeed? What acknowledgements are required?
-
Replication & consistency
: How many replicas, what quorum rules, and how do you handle network partitions?
-
Crash recovery
: If the compute node crashes, how does a new node recover state and resume service? What data structures/checkpoints exist?
-
Write amplification
: Identify sources (WAL, page rewrites, compaction) and propose reductions.
-
Scalability
: How do you scale storage and compute independently? Discuss sharding, rebalancing, and hotspot handling.
-
Observability
: What metrics and logs would you add to detect replication lag, redo backlog, and tail latency?