Implement a Merkle-tree repo diff
Company: Cursor
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's understanding of Merkle trees, filesystem-aware hashing, directory traversal, and diff algorithms for detecting file-level changes.
Constraints
- 0 <= len(snapshot_a), len(snapshot_b) <= 2000
- The total size of all file contents across both snapshots is at most 2 * 10^6 bytes
- Within a single snapshot, all relative paths are unique
- Within a single snapshot, no file path is a prefix of another file path
- Paths are relative, use '/' as the separator, and refer only to regular files and directories implied by those paths
Examples
Input: ([('README.md', 'repo'), ('src/main.py', 'main\n'), ('src/util.py', 'add\n')], [('README.md', 'repo'), ('src/main.py', 'main\n'), ('src/util.py', 'add\n')])
Expected Output: []
Explanation: Both snapshots contain the same files with the same contents, so no file differs.
Input: ([('README.md', 'repo'), ('src/main.py', 'v1\n'), ('src/utils/helpers.py', 'helper\n'), ('docs/old.md', 'old\n')], [('README.md', 'repo'), ('src/main.py', 'v2\n'), ('src/utils/helpers.py', 'helper\n'), ('src/utils/new.py', 'new\n'), ('tests/test_main.py', 'ok\n')])
Expected Output: [('docs/old.md', 'REMOVED'), ('src/main.py', 'MODIFIED'), ('src/utils/new.py', 'ADDED'), ('tests/test_main.py', 'ADDED')]
Explanation: One file was removed, one existing file changed contents, and two new files were added.
Input: ([('config', 'port=8080\n'), ('keep.txt', 'same')], [('config/app.yml', 'port: 8080\n'), ('keep.txt', 'same')])
Expected Output: [('config', 'REMOVED'), ('config/app.yml', 'ADDED')]
Explanation: The path `config` changed from a file to a directory containing a file. Treat that as removing the old file and adding the new file path.
Input: ([], [('README.md', ''), ('src/app.py', 'run\n')])
Expected Output: [('README.md', 'ADDED'), ('src/app.py', 'ADDED')]
Explanation: The first snapshot is empty, so every file in the second snapshot is added.
Input: ([('a/b/c.txt', '1'), ('a/d.txt', '2'), ('x.txt', 'same')], [('x.txt', 'same')])
Expected Output: [('a/b/c.txt', 'REMOVED'), ('a/d.txt', 'REMOVED')]
Explanation: The entire `a` subtree exists only in the first snapshot, so all files inside it are reported as removed.
Input: ([('bin/data.bin', b'\x00\x01'), ('same.txt', 'ok')], [('bin/data.bin', b'\x00\x02'), ('same.txt', 'ok')])
Expected Output: [('bin/data.bin', 'MODIFIED')]
Explanation: The binary file contents differ, so that file is modified while the other file is unchanged.
Hints
- Hash each file from its contents, and hash each directory from its sorted children as tuples like (name, kind, child_hash). Deterministic ordering is essential.
- When diffing, if two directory hashes are equal, you can skip the whole subtree. Only recurse when hashes differ or when a child exists on just one side.