Design a Thread-Safe Buffered Writer with a Background Flush Thread
Company: Databricks
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
## Design a Thread-Safe Buffered Writer with a Background Flush Thread
Design a reusable in-process component — call it `BufferedDiskWriter` — that
many application threads can write records to concurrently, while a single
dedicated **master (flusher) thread** is responsible for actually persisting the
buffered data to disk. The goal is to keep the producing threads fast (they
should not each block on disk I/O), batch many small writes into larger
sequential disk writes, and still give callers a clear story about when their
data is durable.
Think of this as the engine behind an application log, a write-ahead log, or a
metrics/event sink: hundreds of producer threads call `append(record)` at high
frequency, and you do not want every call to issue its own `write()`/`fsync()`
syscall.
This is a **design discussion**, not a coding exercise. Walk through the API,
the concurrency model, the data structures, the flush policy, and the failure
behavior. Justify your trade-offs out loud; pseudo-code for the hot paths is
welcome but a fully compiling implementation is not expected.
### Constraints & Assumptions
- Single process, single machine, multiple threads (tens to low hundreds of
producer threads). Exactly **one** master thread owns the file handle and
performs disk writes.
- Records are small (tens to a few hundred bytes) opaque byte blobs; throughput
can reach `10^5`–`10^6` `append` calls/second in bursts.
- The backing store is a local append-only file on a single disk/SSD.
- Producers strongly prefer low, predictable latency on `append`; a small bound
on how long data may sit unflushed (e.g. tens of milliseconds) is acceptable
for the default mode.
- Memory is bounded — you cannot buffer an unbounded backlog if the disk falls
behind.
### Clarifying Questions to Ask
- **Durability contract:** Must `append` return only after the record is durable
on disk, or is it fire-and-forget with durability happening asynchronously?
Is there a separate `flush()`/`sync()` the caller can await?
- **Ordering:** Do we need a total order across all producers, or only
per-producer (per-thread) ordering of that thread's own records?
- **Loss tolerance on crash:** On process crash or power loss, is it acceptable
to lose the last few milliseconds of un-fsync'd records, or is zero loss
required for acknowledged writes?
- **Backpressure policy:** When producers outrun the disk, should `append`
block, drop records, or fail fast? Which records are droppable?
- **Record framing & corruption:** Do we own the on-disk format (length prefix,
checksums, record boundaries), and must a reader be able to recover after a
torn final write?
### Part 1 — Core concurrency model and API
Define the public API of `BufferedDiskWriter` and the mechanism by which many
producer threads hand records to the single master thread without serializing
on a global lock for every byte written. Describe the in-memory data
structure(s), how producers enqueue, and how the master thread drains and writes
in batches.
```hint Where to start
Separate the *producer* fast path (enqueue into memory) from the *consumer*
slow path (the master thread doing `write`/`fsync`). The only contention is on
the handoff structure, not on disk.
```
```hint Data structure
Consider double buffering (an "active" buffer producers fill and a "flushing"
buffer the master drains), or a bounded multi-producer/single-consumer queue.
A lock held only for the O(1) enqueue/swap — not for the disk write — keeps the
critical section tiny.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Flush policy and backpressure
Specify when the master thread flushes, and what happens when producers generate
records faster than the disk can absorb them. The component must not consume
unbounded memory.
```hint Triggers
Flush on any of: a size threshold (buffer reached N bytes), a time threshold
(at most T milliseconds of latency), or an explicit `flush()` call. "Group
commit" amortizes one `fsync` across many records.
```
```hint Backpressure
A bounded buffer forces a decision when full: block the producer (apply
backpressure), drop/sample records (acceptable for some logs/metrics), or
reject with an error. State which one and why for the target use case.
```
#### Clarifying Questions for this Part
- Is this primarily a metrics/log sink where dropping the least-important records
under overload is acceptable, or a write-ahead log where every acknowledged
record must survive?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Durability, failure handling, and shutdown
Describe the durability guarantee you actually provide and how you handle
failures: a slow or full disk, an I/O error from `write`/`fsync`, a crash midway
through a write, and a clean `close()`.
```hint Durability primitive
A successful `write()` only reaches the OS page cache; durability requires
`fsync`/`fdatasync`. Decide where `fsync` sits relative to acknowledging a
`flush()` so you do not over-promise.
```
```hint Crash safety
Frame each record (length prefix + checksum) so a reader can detect and discard
a torn trailing record after a crash, and so corruption does not desync the
whole file.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- How would you provide a `flush()` that a producer can **await** until its
already-appended records are durable, without blocking other producers? (Hint:
sequence numbers + a "durable up to N" watermark the producer can wait on.)
- Suppose a single master thread + `fsync` becomes the bottleneck. How would you
scale write throughput — multiple files/shards, multiple flusher threads,
direct I/O, or group-commit tuning — and what ordering guarantees would you
lose?
- If you needed total ordering across all producers (not just per-thread), how
would you assign order, and what does that cost on the hot path?
- How would you make the component observable in production (queue depth, flush
latency, drop counts, fsync errors), and which metric would page you first?
Quick Answer: This question assesses a candidate's ability to design a concurrent, thread-safe component that decouples fast in-memory writes from slower disk I/O using a dedicated background thread. It probes practical understanding of concurrency control, buffering and batching strategies, backpressure, and durability guarantees in a systems design context. It is commonly used to test applied, real-world concurrent-systems reasoning rather than pure algorithmic recall.