Fault-Tolerant Backend System Design
Asked of: Software Engineer
Last updated
What's being tested
Interviewers are probing your ability to design fault-tolerant, durable backend systems that maintain correctness under crashes, concurrency, and partial failures while meeting performance constraints. Expect to justify on-disk layouts, recovery paths, concurrency control, and tradeoffs between latency, throughput, and durability. Databricks values engineers who can make pragmatic choices (e.g., between simpler, auditable designs and complex but faster designs) and clearly communicate failure modes and mitigations.
Core knowledge
-
Durability: durable means committed state survives process and machine crashes; the common primitive is the write-ahead log (WAL) plus periodic checkpoints;
fsyncsemantics and mount/OS orderings matter for guarantees. -
WAL vs LSM-tree vs B-tree: WAL provides durable append-only logging; LSM-tree (used by
LevelDB/RocksDB) provides high write throughput via memtables and SSTables; B-tree (used bySQLite,Postgres) favors lower read amplification and in-place updates. -
Compaction & write amplification: compaction in LSMs trades CPU and I/O for read performance; write amplification ≈ total bytes written / user bytes written; typical LSM amplification can be 2–10× depending on levels and compaction strategy.
-
Crash recovery: replay the WAL up to last committed entry, then apply a checkpoint or snapshot; ensure idempotent replays or use sequence numbers to avoid double-application.
-
Atomicity for single-node systems: use atomic file rename or two-phase write with checksum + atomic swap for single-key atomic replacement; for multi-key transactions, make explicit isolation guarantees (e.g., snapshot isolation via MVCC).
-
Concurrency control: for in-process systems, low-latency designs use lock-free reads with copy-on-write memtables; otherwise use fine-grained locks, reader–writer locks, or MVCC to reduce read–write contention.
-
Range reads and caching: support range scans by organizing SSTables or block indexes to map key ranges to file offsets; use block-level caching plus request coalescing to deduplicate concurrent range fetches.
-
Failure modes at boundaries: durable persistence is only as strong as your metadata; store manifest/manifest checksums atomically and validate on startup to detect partial compactions or corrupted SSTables.
-
Eviction, persistence tradeoff: a single-node persistent in-memory cache should checkpoint periodically to disk; tune checkpoint interval against throughput and recovery time objective (RTO).
-
External integration patterns: for aggregators calling external price APIs, use idempotency keys, bounded retries, exponential backoff, and request coalescing to avoid thundering herds and inconsistent caches.
-
Observability & testing: surface
p99, throughput, recovery time, and corruption counts; test using crash-failure injection (kill during compaction, power-failfsyncomitted) and validate invariants with checksums. -
Sizing rules of thumb: memtable default sizes ~64–512MB; SSTable target file sizes ~64–512MB; these work up to hundreds of GB per node; beyond that, favor sharding/partitioning.
Worked example — Design a durable key-value store
First 30s: ask workload (read-heavy vs write-heavy), expected dataset size, durability level (sync-per-write vs async), single-process vs distributed, and whether range scans are required. Skeleton: (1) on-write path: append to WAL then apply to in-memory memtable; (2) background compaction: flush memtable → immutable SSTable and merge levels; (3) read path: check memtable(s) then SSTable indexes with block cache; (4) recovery: replay WALs and load manifest, validate checksums. Key tradeoff: choose sync-per-write fsync for highest durability but high latency, or group commits (batching) for throughput at cost of small data-loss window. Explicit decision: use LSM design (RocksDB-style) if writes dominate; choose B-tree-like approach for small dataset and range latency. Close: state how you'd iterate — add background checksumming, tune compaction strategy, expose metrics, and design a simple replication protocol if multi-node durability is required.
A second angle — Design concurrent range-aware file caching client
Same primitives apply but constraints differ: this client must deduplicate in-flight range requests, persist cached chunks, and validate freshness. Use a chunked on-disk layout and an in-memory index mapping ranges to chunk metadata, protect concurrent access with per-file range locks or lock-free interval trees for low contention. Implement request coalescing so simultaneous requests for overlapping ranges wait on one upstream fetch; use ETags or checksums for validation and conditional GETs to avoid stale data. Tune prefetch and eviction by access patterns: sequential reads favor larger prefetch windows, random reads favor small chunks and LRU. The durability question narrows to atomic writes of cached chunks (use temp-file + atomic rename) and a compact manifest to recover cache state after crashes.
Common pitfalls
Pitfall: Assuming
fsyncis instantaneous — many candidates omit OS and storage latencies; state realistic latencies and batch writes or expose sync-mode configuration to users.
Pitfall: Designing compaction without accounting for write amplification — proposing aggressive merges without quantifying I/O cost will lead to unacceptable backend load.
Pitfall: Overgeneralizing distributed solutions — avoid proposing full consensus (e.g.,
Raft) when the prompt is single-node; instead state why you'd add replication later and how you'd keep the single-node design simple first.
Connections
Interviewers may pivot to distributed consensus (e.g., Raft) when durability needs span nodes, or to storage engine internals like bloom filters and block indexes for read amplification optimizations. They may also ask about observability (SLOs, p99, recovery time) or efficient on-disk formats for analytics workloads.
Further reading
-
The Log-Structured Merge-Tree (LSM-tree) paper — canonical design explaining memtable/SSTable/compaction tradeoffs.
-
Raft: In Search of an Understandable Consensus Algorithm — why and how to add replication/consensus if you extend to multi-node durability.
-
RocksDBdesign notes — pragmatic engineering choices and tuning knobs for single-node high-throughput stores.
Practice questions
- Design a Book Price AggregatorDatabricks · Software Engineer · Technical Screen · medium
- Design a single-node persistent in-memory cacheDatabricks · Software Engineer · Technical Screen · hard
- Design a durable key-value storeDatabricks · Software Engineer · Onsite · hard
- Design concurrent range-aware file caching clientDatabricks · Software Engineer · HR Screen · hard
Related concepts
- Cluster Job Scheduling And Resource Isolation
- Production System Design TradeoffsSystem Design
- Scalable Distributed System ArchitectureSystem Design
- Fault Tolerance, Idempotency, And Concurrency ControlSystem Design
- Distributed Systems Consistency And Low-Latency DesignSystem Design
- Scalable Backend Architecture And Data ModelingSystem Design