PracHub
QuestionsPremiumLearningGuidesCheatsheetNEW

Quick Overview

This question evaluates understanding of scalable file-deduplication and system-design concepts, including content hashing, I/O minimization, hash-collision handling, memory constraints, parallelization, incremental updates, and cross-machine deduplication.

  • Medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Detect duplicate files efficiently

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

You are given access to a very large file system containing file paths and read access to file contents. Design an algorithm to identify groups of files that are byte-identical (true duplicates) across all directories. Return groups of file paths where each group contains duplicates. Discuss an approach that filters by size first, then uses partial hashing, then full hashing to minimize I/O; explain how to handle hash collisions, memory constraints, and parallelization. Extend the design to support incremental updates as files are added/modified, cross-machine deduplication, and safe replacement with hard links or content-addressed storage.

Quick Answer: This question evaluates understanding of scalable file-deduplication and system-design concepts, including content hashing, I/O minimization, hash-collision handling, memory constraints, parallelization, incremental updates, and cross-machine deduplication.

You are building the core index for a duplicate-file detector. In a real file system, an efficient pipeline avoids unnecessary I/O by filtering candidate duplicates by file size, then a partial signature from the beginning/end of the file, then a full hash, and finally exact byte comparison to eliminate hash collisions. For this coding problem, file contents are already provided as strings, and you must maintain duplicate groups incrementally as files are added, modified, deleted, and queried. Process the operations in order. A file path is unique at any moment. 'ADD' and 'MODIFY' both set the current content of a path; if the path already exists, its previous content is replaced. 'DELETE' removes a path if it exists. 'QUERY' asks for the current groups of true duplicates. Two files are true duplicates if their contents are exactly equal. For each 'QUERY', return all duplicate groups that contain at least 2 paths. Each group must be sorted lexicographically, and the list of groups must be sorted by the first path in each group. This models the same core index that could later be sharded across machines or used before safe deduplication with hard links or content-addressed storage.

Constraints

  • 1 <= len(operations) <= 2 * 10^5
  • 0 <= len(content) <= 10^5 for a single ADD/MODIFY operation
  • The sum of len(content) over all ADD/MODIFY operations is <= 10^6
  • 1 <= len(path) <= 200; ADD/MODIFY replace existing content at that path, and DELETE on a missing path has no effect

Examples

Input: ([('ADD', '/a.txt', 'hello'), ('ADD', '/b.txt', 'world'), ('ADD', '/c.txt', 'hello'), ('QUERY',), ('MODIFY', '/b.txt', 'hello'), ('QUERY',)],)

Expected Output: [[['/a.txt', '/c.txt']], [['/a.txt', '/b.txt', '/c.txt']]]

Explanation: The first QUERY sees '/a.txt' and '/c.txt' with identical contents. After modifying '/b.txt' to 'hello', all three files are duplicates.

Input: ([('ADD', '/x/1', 'aa'), ('ADD', '/x/2', 'bb'), ('ADD', '/y/3', 'aa'), ('ADD', '/z/4', 'bb'), ('QUERY',), ('DELETE', '/x/2'), ('QUERY',)],)

Expected Output: [[['/x/1', '/y/3'], ['/x/2', '/z/4']], [['/x/1', '/y/3']]]

Explanation: Initially there are two duplicate groups: the 'aa' files and the 'bb' files. After deleting '/x/2', only the 'aa' group remains.

Input: ([('QUERY',), ('ADD', '/empty', ''), ('ADD', '/also_empty', ''), ('ADD', '/unique', 'x'), ('QUERY',), ('MODIFY', '/also_empty', 'x'), ('QUERY',)],)

Expected Output: [[], [['/also_empty', '/empty']], [['/also_empty', '/unique']]]

Explanation: The first QUERY runs on an empty index. Then two empty files become duplicates. After '/also_empty' is changed to 'x', it becomes a duplicate of '/unique' instead.

Input: ([('ADD', '/a', 'abc'), ('MODIFY', '/b', 'abc'), ('ADD', '/a', 'def'), ('DELETE', '/missing'), ('QUERY',)],)

Expected Output: [[]]

Explanation: MODIFY acts like an upsert, so '/b' is created with 'abc'. Then '/a' is replaced with 'def'. Deleting a missing path does nothing, so no duplicate group exists at QUERY time.

Input: ([('ADD', '/p1', 'abcdefgh12345678X87654321'), ('ADD', '/p2', 'abcdefghABCDEFGHX87654321'), ('ADD', '/p3', 'abcdefgh12345678X87654321'), ('QUERY',)],)

Expected Output: [[['/p1', '/p3']]]

Explanation: Files '/p1' and '/p2' have the same length and the same beginning and ending segments, so a partial-signature filter would keep them as candidates. Exact content still shows only '/p1' and '/p3' are true duplicates.

Solution

def solution(operations):
    import hashlib

    def make_candidate(content):
        size = len(content)
        head = content[:8]
        tail = content[-8:]
        full_hash = hashlib.sha256(content.encode('utf-8')).hexdigest()
        return (size, head, tail, full_hash)

    path_meta = {}
    candidate_buckets = {}
    duplicate_ids = set()

    def remove(path):
        if path not in path_meta:
            return
        cand, content = path_meta.pop(path)
        exact_map = candidate_buckets[cand]
        group = exact_map[content]
        group.remove(path)
        exact_id = (cand, content)

        if len(group) >= 2:
            duplicate_ids.add(exact_id)
        else:
            duplicate_ids.discard(exact_id)

        if not group:
            del exact_map[content]
            duplicate_ids.discard(exact_id)
            if not exact_map:
                del candidate_buckets[cand]

    def add_or_replace(path, content):
        remove(path)
        cand = make_candidate(content)
        exact_map = candidate_buckets.setdefault(cand, {})
        group = exact_map.setdefault(content, set())
        group.add(path)
        exact_id = (cand, content)
        if len(group) >= 2:
            duplicate_ids.add(exact_id)
        path_meta[path] = (cand, content)

    answers = []

    for op in operations:
        kind = op[0]
        if kind == 'ADD' or kind == 'MODIFY':
            _, path, content = op
            add_or_replace(path, content)
        elif kind == 'DELETE':
            _, path = op
            remove(path)
        elif kind == 'QUERY':
            groups = []
            for cand, content in duplicate_ids:
                groups.append(sorted(candidate_buckets[cand][content]))
            groups.sort(key=lambda group: (group[0], group))
            answers.append(groups)
        else:
            raise ValueError(f'Unknown operation: {kind}')

    return answers

Time complexity: ADD/MODIFY: O(L) average, DELETE: O(1) average, QUERY: O(G log G + P log P), where L is content length, G is the number of duplicate groups, and P is the total number of paths returned in that query.. Space complexity: O(F + C), where F is the number of indexed files and C is the total stored content length..

Hints

  1. Keep a forward map from path to its current bucket so an update can remove the old version in O(1) average time before reinserting the new one.
  2. Use a reverse index keyed by size, partial signature, full hash, and exact content. Also track which exact-content buckets currently have at least two paths so QUERY does not need to rescan every stored file.
Last updated: Apr 20, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Convert Samples into Event Intervals - Anthropic (medium)
  • Convert State Stream to Events - Anthropic (medium)
  • Build a concurrent web crawler - Anthropic (medium)
  • Implement a Parallel Image Processor - Anthropic (medium)
  • Implement a Batch Image Processor - Anthropic (medium)