PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates filesystem traversal, efficient I/O and memory use, hashing strategies for content comparison, duplicate-detection algorithms, and time/space complexity analysis.

  • Medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Detect duplicate files by content

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

Detect duplicate files by content in a filesystem. Given access to a directory tree, return groups of file paths that have identical byte content. Minimize disk I/O by first grouping by file size, then using hashing (e.g., fast hash followed by cryptographic hash on collisions) and final byte-by-byte verification. Handle very large files via streaming reads, permission errors, and symbolic links. Provide working code and analyze time/space complexity.

Quick Answer: This question evaluates filesystem traversal, efficient I/O and memory use, hashing strategies for content comparison, duplicate-detection algorithms, and time/space complexity analysis.

Given a simulated directory-tree scan, find groups of files that have identical byte content. In a real filesystem, you would minimize disk I/O by grouping files by size first, then hashing candidate files in streaming chunks, and finally verifying matches byte-by-byte so hash collisions cannot produce false duplicates. For this coding problem, the directory walk is represented by parallel arrays. Files that cannot be read because of permission errors must be skipped, and symbolic links must not be followed.

Constraints

  • 0 <= len(paths) == len(contents) == len(readable) == len(is_symlink) <= 10000
  • Each paths[i] is unique.
  • contents[i] is a string representing raw bytes; every character has code point 0 through 255.
  • 0 <= len(contents[i]) <= 1000000
  • The total content length in the test data is at most 5000000.
  • Files with readable[i] == False or is_symlink[i] == True must be ignored.

Examples

Input: (['/root/a.txt', '/root/b.txt', '/root/c.txt', '/root/sub/d.txt', '/root/e.bin'], ['hello', 'world', 'hello', 'hello', 'hello!'], [True, True, True, True, True], [False, False, False, False, False])

Expected Output: [['/root/a.txt', '/root/c.txt', '/root/sub/d.txt']]

Explanation: Three readable regular files contain exactly 'hello'. The file containing 'world' differs, and 'hello!' has a different size.

Input: (['/a/empty1', '/b/empty2', '/a/img1', '/b/img2', '/c/text'], ['', '', 'ABC', 'ABC', 'ABD'], [True, True, True, True, True], [False, False, False, False, False])

Expected Output: [['/a/empty1', '/b/empty2'], ['/a/img1', '/b/img2']]

Explanation: The two empty files are duplicates, and the two files containing 'ABC' are duplicates. 'ABD' has the same size as 'ABC' but different content.

Input: (['/readable/original', '/secret/copy', '/link/copy', '/readable/other'], ['data', 'data', 'data', 'other'], [True, False, True, True], [False, False, True, False])

Expected Output: []

Explanation: The unreadable file and symbolic link are ignored, leaving no duplicate group among readable regular files.

Input: (['/x/one', '/x/two', '/x/three', '/x/four'], ['ab', 'ba', 'cd', 'dc'], [True, True, True, True], [False, False, False, False])

Expected Output: []

Explanation: All files have the same size, but none have identical content.

Input: ([], [], [], [])

Expected Output: []

Explanation: An empty directory scan contains no duplicate files.

Solution

def solution(paths, contents, readable, is_symlink):
    import hashlib
    import zlib

    CHUNK_SIZE = 4096

    def iter_encoded_chunks(data):
        for start in range(0, len(data), CHUNK_SIZE):
            yield data[start:start + CHUNK_SIZE].encode('latin1')

    def fast_hash(data):
        value = 0
        for chunk in iter_encoded_chunks(data):
            value = zlib.crc32(chunk, value)
        return value

    def crypto_hash(data):
        h = hashlib.sha256()
        for chunk in iter_encoded_chunks(data):
            h.update(chunk)
        return h.digest()

    def same_content(a, b):
        if len(a) != len(b):
            return False
        for start in range(0, len(a), CHUNK_SIZE):
            if a[start:start + CHUNK_SIZE] != b[start:start + CHUNK_SIZE]:
                return False
        return True

    by_size = {}
    for i in range(len(paths)):
        if not readable[i] or is_symlink[i]:
            continue
        size = len(contents[i])
        by_size.setdefault(size, []).append((paths[i], contents[i]))

    answer = []

    for size_group in by_size.values():
        if len(size_group) < 2:
            continue

        by_fast_hash = {}
        for path, data in size_group:
            key = fast_hash(data)
            by_fast_hash.setdefault(key, []).append((path, data))

        for fast_group in by_fast_hash.values():
            if len(fast_group) < 2:
                continue

            by_crypto_hash = {}
            for path, data in fast_group:
                key = crypto_hash(data)
                by_crypto_hash.setdefault(key, []).append((path, data))

            for hash_group in by_crypto_hash.values():
                if len(hash_group) < 2:
                    continue

                exact_groups = []
                for path, data in hash_group:
                    placed = False
                    for group in exact_groups:
                        representative, group_paths = group
                        if same_content(data, representative):
                            group_paths.append(path)
                            placed = True
                            break
                    if not placed:
                        exact_groups.append([data, [path]])

                for representative, group_paths in exact_groups:
                    if len(group_paths) >= 2:
                        answer.append(sorted(group_paths))

    answer.sort(key=lambda group: group[0])
    return answer

Time complexity: O(n + B) expected, where n is the number of entries and B is the total number of bytes streamed from readable non-symlink files that survive size filtering. Candidate files may be streamed a constant number of times for fast hashing, cryptographic hashing, and final verification.. Space complexity: O(n + R) auxiliary space, where n is the number of entries and R is the size of the returned duplicate groups. This excludes the input contents..

Hints

  1. Files with different sizes cannot have identical byte content, so use file size as the first grouping key.
  2. Hashes are useful for reducing comparisons, but do not trust hashes alone; verify candidate duplicates byte-by-byte against a representative.
Last updated: Jun 22, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement a Banking System - Anthropic (medium)
  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)