How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a Medium difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Find and remove duplicate files | Anthropic Coding Question

Quick Overview

This question evaluates the ability to design scalable file deduplication algorithms, specifically testing knowledge of hashing strategies, collision handling, memory and I/O optimization, handling very large files, incremental/resumable operation, and complexity and trade-off analysis.

Find and remove duplicate files

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

Given a directory tree that may not fit in memory, detect and optionally remove duplicate files. Define the algorithm, including how you handle very large files, hashing strategy (e.g., size grouping, partial hash, full hash or chunked rolling hash), collision handling, memory and I/O optimization, and how you would make it incremental and resumable. Provide complexity analysis and discuss trade-offs.

Quick Answer: This question evaluates the ability to design scalable file deduplication algorithms, specifically testing knowledge of hashing strategies, collision handling, memory and I/O optimization, handling very large files, incremental/resumable operation, and complexity and trade-off analysis.

You are given a snapshot of a directory tree as a list of (path, content) pairs. Two files are duplicates if their contents are exactly equal, even if their paths are different. Write a function solution(files, remove) that simulates a scalable deduplication pipeline instead of doing O(n^2) pairwise comparisons: first group files by size, then by a cheap partial fingerprint, then by a full hash, and finally verify exact equality so hash collisions never cause wrong removals. For this coding task the file contents are already provided as strings, but the same staged design is what you would use for very large files by hashing them chunk-by-chunk and persisting metadata for incremental, resumable scans. For each duplicate group, keep the lexicographically smallest path as the canonical copy. If remove is True, return the other paths as removable. Do not mutate the input.

Constraints

0 <= len(files) <= 200000
Each path is unique
0 <= len(content) <= 100000 for each file
The sum of all content lengths is at most 2000000 in this coding version

Examples

Input: ([('root/a.txt', 'abc'), ('root/b.txt', 'xyz'), ('root/c.txt', 'abc'), ('root/d.txt', 'xyz'), ('root/e.txt', 'p')], True)

Expected Output: ([['root/a.txt', 'root/c.txt'], ['root/b.txt', 'root/d.txt']], ['root/c.txt', 'root/d.txt'])

Explanation: Files root/a.txt and root/c.txt match exactly, and root/b.txt and root/d.txt match exactly. root/e.txt is unique.

Input: ([('a', 'cat'), ('b', 'cat'), ('c', 'dog')], False)

Expected Output: ([['a', 'b']], [])

Explanation: The duplicate group is still reported, but remove is False so no path is marked for deletion.

Input: ([('a', ''), ('b', ''), ('c', 'x')], True)

Expected Output: ([['a', 'b']], ['b'])

Explanation: Empty files are valid content, so a and b are duplicates and b is removed because a is lexicographically smaller.

Input: ([('p1', 'abcd1234wxyz'), ('p2', 'abcdZZZZwxyz'), ('p3', 'abcd1234wxyz')], True)

Expected Output: ([['p1', 'p3']], ['p3'])

Explanation: All three files have the same size and the same prefix/suffix, so a partial fingerprint alone is not enough. Only p1 and p3 are exact matches after full hash and exact verification.

Input: ([('z', 'm'), ('a', 'm'), ('b', 'm'), ('x', 'n')], True)

Expected Output: ([['a', 'b', 'z']], ['b', 'z'])

Explanation: Three files have the same content. The lexicographically smallest path a is kept, and b and z are removable.

Input: ([('only', 'data')], True)

Expected Output: ([], [])

Explanation: A single file cannot form a duplicate group.

Solution

def solution(files, remove):
    from collections import defaultdict
    import hashlib

    CHUNK = 4
    size_groups = defaultdict(list)

    for path, content in files:
        size_groups[len(content)].append((path, content))

    duplicate_groups = []

    for candidates in size_groups.values():
        if len(candidates) < 2:
            continue

        partial_groups = defaultdict(list)
        for path, content in candidates:
            if len(content) <= 2 * CHUNK:
                partial = (content, "")
            else:
                partial = (content[:CHUNK], content[-CHUNK:])
            partial_groups[partial].append((path, content))

        for partial_bucket in partial_groups.values():
            if len(partial_bucket) < 2:
                continue

            full_hash_groups = defaultdict(list)
            for path, content in partial_bucket:
                digest = hashlib.sha256(content.encode("utf-8")).digest()
                full_hash_groups[digest].append((path, content))

            for hash_bucket in full_hash_groups.values():
                if len(hash_bucket) < 2:
                    continue

                exact_groups = defaultdict(list)
                for path, content in hash_bucket:
                    exact_groups[content].append(path)

                for paths in exact_groups.values():
                    if len(paths) >= 2:
                        paths.sort()
                        duplicate_groups.append(paths)

    duplicate_groups.sort(key=lambda group: (group[0], len(group), group))

    removed_paths = []
    if remove:
        for group in duplicate_groups:
            removed_paths.extend(group[1:])
        removed_paths.sort()

    return duplicate_groups, removed_paths

Time complexity: O(T + n log n), where T is the total number of characters across all file contents; the log factor comes from sorting duplicate groups and removable paths.. Space complexity: O(n) extra space, excluding the input contents..

Hints

Start by grouping files by length; files with different sizes can never be duplicates.
Use a multi-stage fingerprint: cheap prefix/suffix key first, full hash next, and exact content equality last to protect against collisions.

Quick Overview