Design a Hierarchical File System
Company: Databricks
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Onsite
Design a **hierarchical file system service from scratch**, without relying on an existing blob store or distributed object storage (you must store the raw file bytes yourself, on your own storage nodes). The service exposes a POSIX-like tree of directories and files and must support the following operations:
- **Create** a file or a directory at a given path.
- **Read** and **write** the contents of a file.
- **List** the immediate children of a directory.
- **Delete a directory recursively** (the directory, all of its descendants, and all of their file data).
The interviewer cares much more about depth than breadth. Spend most of your time on the two operations that are genuinely hard at scale: **directory listing** and **recursive directory deletion**. For those, be concrete about the **metadata layout**, **consistency model**, **locking / leasing**, **where and how raw file bytes are stored**, and **recovery after a crash mid-operation**.
```hint Where to start
**Separate metadata from data.** A path lookup, a `list`, and a permission check only ever touch *metadata* (inodes + directory entries); the actual file bytes live elsewhere. Conflating the two is the most common way this design goes wrong.
```
```hint Metadata model
Think in terms of **inodes** and **directory entries** as two separate concepts, the way classic filesystems do. An inode describes *an object* (type, size, owner, block list); a directory entry maps *a name within a parent* to a child inode. This indirection is what makes hard links, rename, and large-directory indexing tractable.
```
```hint Large directories
A directory with tens of millions of children cannot be stored as one row or one file. What data structure lets you both *look up one child by name* cheaply and *list children in order, a page at a time* without re-reading the whole directory? Think about what key you'd index on so a single directory's children are contiguous.
```
```hint Recursive delete
Doing the whole subtree synchronously will touch millions of objects and is fragile if it crashes halfway. Decouple the two halves: make the **namespace removal** appear atomic/immediate, and push the **physical reclamation** of bytes to an idempotent background process. The keyword is *tombstone + async garbage collection*.
```
```hint Consistency & races
The dangerous interleavings are `list` vs `delete` and `create` vs `delete` on the same directory. Consider a per-subtree **lock or lease**, plus a "directory is being deleted" state that blocks new children, and version/CAS checks so a stale `list` can't resurrect a deleted entry.
```
```hint Crash recovery
Write the intent **before** you mutate (a write-ahead log / journal), make every step idempotent, and persist enough state ("this subtree is mid-delete") that a fresh process can *resume* an interrupted operation rather than leave the tree half-deleted.
```
### Constraints & Assumptions
State your assumptions explicitly, but a reasonable default scale to design for is:
- Billions of files and directories total; a single directory may hold anywhere from a few entries up to **tens of millions** of children (e.g. a logging or dataset directory).
- File sizes range from a few bytes to multiple TB, so large files must be split across storage.
- Reads are far more frequent than writes (assume roughly 100:1), and `list` is one of the hottest operations.
- Metadata operations (`create`, `delete`, `rename`, `list`) must survive process and node crashes with no lost or partially-applied trees.
- You do **not** have S3 / GCS / an existing blob store. You own the disks.
- Multiple clients may operate on the same subtree concurrently.
### Clarifying Questions to Ask
- Is this a single-machine filesystem or a distributed one across many nodes? Roughly how many nodes and how much total capacity?
- What consistency do we owe `list` and `read` — strictly linearizable, read-your-writes, or is eventual consistency acceptable for listing?
- Are renames/moves and hard links in scope, or only create/read/write/list/delete?
- For recursive delete, must the directory *disappear* atomically from the user's view, or is a brief "deleting…" state acceptable as long as the bytes are eventually reclaimed?
- What are the latency targets for `list` on a small directory vs. a 10M-entry directory?
- Do we need POSIX permissions/ownership and atomic-rename semantics, or simpler ACLs?
### What a Strong Answer Covers
- A clean **metadata-vs-data split**, with named components (namespace/path resolution, metadata store, storage nodes, a lock/lease manager, a GC/cleanup worker).
- A concrete **metadata schema**: an inode table and a separate directory-entry index, and how a file inode points to its data blocks.
- **Path resolution** walking root → target, and how hot directory metadata is cached without serving stale authoritative state.
- **`list` at scale**: ordered index keyed by `(parent_inode_id, name)`, **pagination/cursors**, and stable ordering under concurrent mutation.
- **Storing file bytes yourself**: chunking large files into fixed-size blocks, replication across storage nodes, and the inode→block-list mapping.
- **Recursive delete** as *logical delete now, physical delete later*: subtree lock, mark-deleting, transactional namespace removal, tombstones, async GC of blocks.
- **Consistency & isolation**: which metadata ops need strong consistency, and how `list`/`create`/`delete` races are prevented (transactions, leases, version checks).
- **Failure handling & recovery**: WAL/journal before apply, idempotent workers, resuming interrupted deletes, replication of both metadata and data.
- **Scaling & hotspots**: partitioning metadata (by inode range or subtree), and the danger of a single hot directory becoming a bottleneck.
### Follow-up Questions
- A recursive delete of a 50M-file directory crashes after the namespace removal commits but before any blocks are reclaimed. Walk through exactly how the system recovers and guarantees those blocks are eventually freed without freeing anything still referenced.
- A client calls `list` on a directory while another client is recursively deleting it. What does the first client see, and how do you prevent it from observing a torn or resurrected view of the tree?
- How would you support an **atomic `rename`** of a directory (including moving a large subtree to a new parent) without physically rewriting every descendant entry?
- The metadata store for one subtree becomes a hotspot under heavy `create`/`list` traffic. How do you detect this and rebalance, and what breaks if you naively split a single hot directory across partitions?
Quick Answer: This question evaluates understanding of scalable distributed storage and namespace architecture, covering metadata versus data separation, inode-like metadata models, data placement, consistency and concurrency control, and crash recovery.