PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates a candidate's understanding of Merkle trees, filesystem-aware hashing, directory traversal, and diff algorithms for detecting file-level changes.

  • hard
  • Cursor
  • Coding & Algorithms
  • Software Engineer

Implement a Merkle-tree repo diff

Company: Cursor

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Technical Screen

You are given two snapshots of a source-code repository on disk, each represented by a root directory path. Implement a Merkle tree over the **directory structure** (folders are internal nodes, files are leaves), and use it to compute which files changed between the two snapshots. Use the following API (Python-like): ```python from enum import Enum, auto from typing import Self class ChangeType(Enum): ADDED = auto() MODIFIED = auto() REMOVED = auto() class MerkleTree: def __init__(self, root_directory: str): """Build the Merkle tree from the filesystem under root_directory.""" ... def diff(self, other: Self) -> list[tuple[str, ChangeType]]: """Return a list of (relative_path, ChangeType) for all files that differ.""" ... ``` Requirements / clarifications: - The tree structure must follow the **actual folder hierarchy** in the repo (not a binary tree). - A file node’s hash should depend on its **contents** (and optionally its name); a directory node’s hash should depend on its children (names + hashes) in a deterministic order. - `diff()` should report **files only** as: - `ADDED`: path exists only in `other`. - `REMOVED`: path exists only in `self`. - `MODIFIED`: path exists in both but file contents differ. - Paths in the output should be **relative to the root** (e.g., `"src/main.py"`). - If a path changes type (file ↔ directory), you may treat it as removed + added for the affected file paths. - Use only the standard library to read directories/files. Design and implement the data structures and logic so it can run against real repositories on disk.

Quick Answer: This question evaluates a candidate's understanding of Merkle trees, filesystem-aware hashing, directory traversal, and diff algorithms for detecting file-level changes.

Implement a repository diff using a Merkle tree that mirrors the real directory hierarchy. Files are leaves, and directories are internal nodes with any number of children. A file node's hash must depend on its contents. A directory node's hash must depend on its children's names, types, and hashes in a deterministic sorted order. If two directory hashes are equal, you should be able to skip that entire subtree during diffing. Original interview API: - MerkleTree(root_directory) - diff(other) -> list[(relative_path, ChangeType)] To keep the judge self-contained, implement `solution(snapshot_a, snapshot_b)` instead. Each snapshot is a list of `(relative_path, contents)` pairs describing the files in that repository snapshot. Paths use `/` as the separator, and directories are implied by the file paths. Your implementation should still follow the real on-disk logic from the interview question. The reference solution materializes the snapshots into temporary directories, builds an on-disk Merkle tree using only the standard library, and diffs the two trees. Return a lexicographically sorted list of `(relative_path, change_type)` for files only, where `change_type` is one of `'ADDED'`, `'MODIFIED'`, or `'REMOVED'`. - `ADDED`: path exists only in `snapshot_b` - `REMOVED`: path exists only in `snapshot_a` - `MODIFIED`: path exists in both snapshots but file contents differ If a path changes type between snapshots (file <-> directory), treat it as removed plus added for the affected file paths.

Constraints

  • 0 <= len(snapshot_a), len(snapshot_b) <= 2000
  • The total size of all file contents across both snapshots is at most 2 * 10^6 bytes
  • Within a single snapshot, all relative paths are unique
  • Within a single snapshot, no file path is a prefix of another file path
  • Paths are relative, use '/' as the separator, and refer only to regular files and directories implied by those paths

Examples

Input: ([('README.md', 'repo'), ('src/main.py', 'main\n'), ('src/util.py', 'add\n')], [('README.md', 'repo'), ('src/main.py', 'main\n'), ('src/util.py', 'add\n')])

Expected Output: []

Explanation: Both snapshots contain the same files with the same contents, so no file differs.

Input: ([('README.md', 'repo'), ('src/main.py', 'v1\n'), ('src/utils/helpers.py', 'helper\n'), ('docs/old.md', 'old\n')], [('README.md', 'repo'), ('src/main.py', 'v2\n'), ('src/utils/helpers.py', 'helper\n'), ('src/utils/new.py', 'new\n'), ('tests/test_main.py', 'ok\n')])

Expected Output: [('docs/old.md', 'REMOVED'), ('src/main.py', 'MODIFIED'), ('src/utils/new.py', 'ADDED'), ('tests/test_main.py', 'ADDED')]

Explanation: One file was removed, one existing file changed contents, and two new files were added.

Input: ([('config', 'port=8080\n'), ('keep.txt', 'same')], [('config/app.yml', 'port: 8080\n'), ('keep.txt', 'same')])

Expected Output: [('config', 'REMOVED'), ('config/app.yml', 'ADDED')]

Explanation: The path `config` changed from a file to a directory containing a file. Treat that as removing the old file and adding the new file path.

Input: ([], [('README.md', ''), ('src/app.py', 'run\n')])

Expected Output: [('README.md', 'ADDED'), ('src/app.py', 'ADDED')]

Explanation: The first snapshot is empty, so every file in the second snapshot is added.

Input: ([('a/b/c.txt', '1'), ('a/d.txt', '2'), ('x.txt', 'same')], [('x.txt', 'same')])

Expected Output: [('a/b/c.txt', 'REMOVED'), ('a/d.txt', 'REMOVED')]

Explanation: The entire `a` subtree exists only in the first snapshot, so all files inside it are reported as removed.

Input: ([('bin/data.bin', b'\x00\x01'), ('same.txt', 'ok')], [('bin/data.bin', b'\x00\x02'), ('same.txt', 'ok')])

Expected Output: [('bin/data.bin', 'MODIFIED')]

Explanation: The binary file contents differ, so that file is modified while the other file is unchanged.

Hints

  1. Hash each file from its contents, and hash each directory from its sorted children as tuples like (name, kind, child_hash). Deterministic ordering is essential.
  2. When diffing, if two directory hashes are equal, you can skip the whole subtree. Only recurse when hashes differ or when a child exists on just one side.
Last updated: Apr 19, 2026

Related Coding Questions

  • Implement Notification Rate Limiter - Cursor (medium)

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.