How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a Medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Implement file deduplication at scale | Anthropic Coding Question

Quick Overview

This question evaluates a candidate's ability to design and implement scalable file deduplication tools, covering competencies in file-system and OS APIs, efficient hashing and large-file handling, I/O and memory trade-offs, and handling filesystem semantics such as symlinks, permissions, and error cases.

Implement file deduplication at scale

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

Implement a command-line tool to find duplicate files in a directory tree. Use OS/pathlib primitives to recursively enumerate files. Apply prefilters (e.g., group by file size and/or first N bytes) to narrow candidates, then confirm duplicates with a strong hash (chunked for large files). Discuss handling of large files, memory and I/O trade-offs, symlinks and permissions, error cases, and how to output groups of duplicates with options to hard-link, move, or delete. Analyze time and space complexity.

Quick Answer: This question evaluates a candidate's ability to design and implement scalable file deduplication tools, covering competencies in file-system and OS APIs, efficient hashing and large-file handling, I/O and memory trade-offs, and handling filesystem semantics such as symlinks, permissions, and error cases.

You are building the core logic for a file-deduplication command-line tool. In a real system, you would walk a directory tree with pathlib/os and stream file contents from disk. For this problem, the directory tree has already been enumerated for you as a list of entry records, so you only need to implement the deduplication algorithm. Each entry is a dictionary with these keys: - 'path': unique file-system path - 'type': one of 'file', 'dir', or 'symlink' - 'readable': boolean - 'content': bytes for readable regular files Only readable regular files should be considered. Ignore directories, symlinks, unreadable files, and malformed records. Two files are duplicates if their contents are exactly identical byte-for-byte. To scale, do not compare every file with every other file. Instead: 1. Group files by size. 2. Within each size group, group by the first prefix_bytes bytes. 3. Only then confirm duplicates with a strong SHA-256 hash computed chunk-by-chunk using chunk_size bytes per update. Return all duplicate groups. Each group must contain the duplicate file paths in lexicographic order. The list of groups must also be sorted by each group's first path. A real CLI could use these groups to hard-link, move, or delete duplicates, but in this problem you only need to report the groups.

Constraints

0 <= len(entries) <= 100000
0 <= prefix_bytes <= 4096
1 <= chunk_size <= 4096
The total number of bytes across readable regular files is at most 10^7
For valid readable file records, 'content' is a bytes object and 'path' values are unique

Examples

Input: ([{'path': '/a/report.txt', 'type': 'file', 'readable': True, 'content': b'hello world'}, {'path': '/b/copy.txt', 'type': 'file', 'readable': True, 'content': b'hello world'}, {'path': '/c/other.txt', 'type': 'file', 'readable': True, 'content': b'HELLO WORLD'}, {'path': '/d/link', 'type': 'symlink', 'readable': True, 'content': b'hello world'}], 4, 3)

Expected Output: [['/a/report.txt', '/b/copy.txt']]

Explanation: The two readable regular files with identical content are duplicates. The symlink is ignored, and '/c/other.txt' differs by content.

Input: ([{'path': '/x/a.bin', 'type': 'file', 'readable': True, 'content': b'abcdef12'}, {'path': '/x/b.bin', 'type': 'file', 'readable': True, 'content': b'abcdef34'}, {'path': '/x/c.bin', 'type': 'file', 'readable': True, 'content': b'abcdef56'}, {'path': '/x/private.bin', 'type': 'file', 'readable': False, 'content': b'abcdef12'}], 6, 2)

Expected Output: []

Explanation: All readable files have the same size and first 6 bytes, but their full contents differ, so no duplicate group should be reported.

Input: ([{'path': '/empty/one', 'type': 'file', 'readable': True, 'content': b''}, {'path': '/empty/two', 'type': 'file', 'readable': True, 'content': b''}, {'path': '/empty/skip', 'type': 'file', 'readable': False, 'content': b''}, {'path': '/empty/dir', 'type': 'dir', 'readable': True, 'content': b''}], 8, 4)

Expected Output: [['/empty/one', '/empty/two']]

Explanation: Empty readable regular files are duplicates of each other. The unreadable file and directory are ignored.

Input: ([{'path': '/z/f1.txt', 'type': 'file', 'readable': True, 'content': b'same1'}, {'path': '/a/f2.txt', 'type': 'file', 'readable': True, 'content': b'same1'}, {'path': '/m/f3.txt', 'type': 'file', 'readable': True, 'content': b'data'}, {'path': '/b/f4.txt', 'type': 'file', 'readable': True, 'content': b'data'}, {'path': '/c/f5.txt', 'type': 'file', 'readable': True, 'content': b'unique'}], 2, 2)

Expected Output: [['/a/f2.txt', '/z/f1.txt'], ['/b/f4.txt', '/m/f3.txt']]

Explanation: There are two duplicate groups: one for content 'same1' and one for content 'data'. Paths inside groups and the groups themselves are sorted lexicographically.

Input: ([], 8, 4)

Expected Output: []

Explanation: With no entries, there are no duplicates.

Solution

def solution(entries, prefix_bytes, chunk_size):
    from collections import defaultdict
    import hashlib

    if prefix_bytes < 0 or chunk_size <= 0:
        raise ValueError('prefix_bytes must be >= 0 and chunk_size must be > 0')

    def digest_bytes(data):
        hasher = hashlib.sha256()
        view = memoryview(data)
        for i in range(0, len(data), chunk_size):
            hasher.update(view[i:i + chunk_size])
        return hasher.hexdigest()

    size_groups = defaultdict(list)

    for entry in entries:
        if not isinstance(entry, dict):
            continue
        if entry.get('type') != 'file' or not entry.get('readable', False):
            continue

        path = entry.get('path')
        content = entry.get('content')
        if path is None or not isinstance(content, (bytes, bytearray)):
            continue

        if isinstance(content, bytearray):
            content = bytes(content)

        size_groups[len(content)].append((path, content))

    duplicates = []

    for files_same_size in size_groups.values():
        if len(files_same_size) < 2:
            continue

        prefix_groups = defaultdict(list)
        for path, content in files_same_size:
            prefix_groups[content[:prefix_bytes]].append((path, content))

        for files_same_prefix in prefix_groups.values():
            if len(files_same_prefix) < 2:
                continue

            hash_groups = defaultdict(list)
            for path, content in files_same_prefix:
                digest = digest_bytes(content)
                hash_groups[digest].append((path, content))

            for files_same_hash in hash_groups.values():
                if len(files_same_hash) < 2:
                    continue

                exact_groups = defaultdict(list)
                for path, content in files_same_hash:
                    exact_groups[content].append(path)

                for paths in exact_groups.values():
                    if len(paths) > 1:
                        paths.sort()
                        duplicates.append(paths)

    duplicates.sort(key=lambda group: group[0])
    return duplicates

Time complexity: O(n + B + n log n) worst-case, where n is the number of entries and B is the total number of bytes hashed across candidate files. Space complexity: O(n).

Hints

Files with different sizes can never be duplicates, so size is a very strong first filter.
Inside a candidate bucket, use a hash map from digest to paths so you avoid O(k^2) pairwise comparisons.

Quick Overview