PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates the ability to design scalable file deduplication algorithms, specifically testing knowledge of hashing strategies, collision handling, memory and I/O optimization, handling very large files, incremental/resumable operation, and complexity and trade-off analysis.

  • Medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Find and remove duplicate files

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

Given a directory tree that may not fit in memory, detect and optionally remove duplicate files. Define the algorithm, including how you handle very large files, hashing strategy (e.g., size grouping, partial hash, full hash or chunked rolling hash), collision handling, memory and I/O optimization, and how you would make it incremental and resumable. Provide complexity analysis and discuss trade-offs.

Quick Answer: This question evaluates the ability to design scalable file deduplication algorithms, specifically testing knowledge of hashing strategies, collision handling, memory and I/O optimization, handling very large files, incremental/resumable operation, and complexity and trade-off analysis.

You are given a snapshot of a directory tree as a list of (path, content) pairs. Two files are duplicates if their contents are exactly equal, even if their paths are different. Write a function solution(files, remove) that simulates a scalable deduplication pipeline instead of doing O(n^2) pairwise comparisons: first group files by size, then by a cheap partial fingerprint, then by a full hash, and finally verify exact equality so hash collisions never cause wrong removals. For this coding task the file contents are already provided as strings, but the same staged design is what you would use for very large files by hashing them chunk-by-chunk and persisting metadata for incremental, resumable scans. For each duplicate group, keep the lexicographically smallest path as the canonical copy. If remove is True, return the other paths as removable. Do not mutate the input.

Constraints

  • 0 <= len(files) <= 200000
  • Each path is unique
  • 0 <= len(content) <= 100000 for each file
  • The sum of all content lengths is at most 2000000 in this coding version

Examples

Input: ([('root/a.txt', 'abc'), ('root/b.txt', 'xyz'), ('root/c.txt', 'abc'), ('root/d.txt', 'xyz'), ('root/e.txt', 'p')], True)

Expected Output: ([['root/a.txt', 'root/c.txt'], ['root/b.txt', 'root/d.txt']], ['root/c.txt', 'root/d.txt'])

Explanation: Files root/a.txt and root/c.txt match exactly, and root/b.txt and root/d.txt match exactly. root/e.txt is unique.

Input: ([('a', 'cat'), ('b', 'cat'), ('c', 'dog')], False)

Expected Output: ([['a', 'b']], [])

Explanation: The duplicate group is still reported, but remove is False so no path is marked for deletion.

Input: ([('a', ''), ('b', ''), ('c', 'x')], True)

Expected Output: ([['a', 'b']], ['b'])

Explanation: Empty files are valid content, so a and b are duplicates and b is removed because a is lexicographically smaller.

Input: ([('p1', 'abcd1234wxyz'), ('p2', 'abcdZZZZwxyz'), ('p3', 'abcd1234wxyz')], True)

Expected Output: ([['p1', 'p3']], ['p3'])

Explanation: All three files have the same size and the same prefix/suffix, so a partial fingerprint alone is not enough. Only p1 and p3 are exact matches after full hash and exact verification.

Input: ([('z', 'm'), ('a', 'm'), ('b', 'm'), ('x', 'n')], True)

Expected Output: ([['a', 'b', 'z']], ['b', 'z'])

Explanation: Three files have the same content. The lexicographically smallest path a is kept, and b and z are removable.

Input: ([('only', 'data')], True)

Expected Output: ([], [])

Explanation: A single file cannot form a duplicate group.

Solution

def solution(files, remove):
    from collections import defaultdict
    import hashlib

    CHUNK = 4
    size_groups = defaultdict(list)

    for path, content in files:
        size_groups[len(content)].append((path, content))

    duplicate_groups = []

    for candidates in size_groups.values():
        if len(candidates) < 2:
            continue

        partial_groups = defaultdict(list)
        for path, content in candidates:
            if len(content) <= 2 * CHUNK:
                partial = (content, "")
            else:
                partial = (content[:CHUNK], content[-CHUNK:])
            partial_groups[partial].append((path, content))

        for partial_bucket in partial_groups.values():
            if len(partial_bucket) < 2:
                continue

            full_hash_groups = defaultdict(list)
            for path, content in partial_bucket:
                digest = hashlib.sha256(content.encode("utf-8")).digest()
                full_hash_groups[digest].append((path, content))

            for hash_bucket in full_hash_groups.values():
                if len(hash_bucket) < 2:
                    continue

                exact_groups = defaultdict(list)
                for path, content in hash_bucket:
                    exact_groups[content].append(path)

                for paths in exact_groups.values():
                    if len(paths) >= 2:
                        paths.sort()
                        duplicate_groups.append(paths)

    duplicate_groups.sort(key=lambda group: (group[0], len(group), group))

    removed_paths = []
    if remove:
        for group in duplicate_groups:
            removed_paths.extend(group[1:])
        removed_paths.sort()

    return duplicate_groups, removed_paths

Time complexity: O(T + n log n), where T is the total number of characters across all file contents; the log factor comes from sorting duplicate groups and removable paths.. Space complexity: O(n) extra space, excluding the input contents..

Hints

  1. Start by grouping files by length; files with different sizes can never be duplicates.
  2. Use a multi-stage fingerprint: cheap prefix/suffix key first, full hash next, and exact content equality last to protect against collisions.
Last updated: Jun 6, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)
  • Implement Task Management and Duplicate Detection - Anthropic (medium)