PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates understanding of scalable file-deduplication and system-design concepts, including content hashing, I/O minimization, hash-collision handling, memory constraints, parallelization, incremental updates, and cross-machine deduplication.

  • Medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Detect duplicate files efficiently

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

You are given access to a very large file system containing file paths and read access to file contents. Design an algorithm to identify groups of files that are byte-identical (true duplicates) across all directories. Return groups of file paths where each group contains duplicates. Discuss an approach that filters by size first, then uses partial hashing, then full hashing to minimize I/O; explain how to handle hash collisions, memory constraints, and parallelization. Extend the design to support incremental updates as files are added/modified, cross-machine deduplication, and safe replacement with hard links or content-addressed storage.

Quick Answer: This question evaluates understanding of scalable file-deduplication and system-design concepts, including content hashing, I/O minimization, hash-collision handling, memory constraints, parallelization, incremental updates, and cross-machine deduplication.

Maintain an **incremental index of duplicate files** as files are added, modified, deleted, and queried, and report the current groups of byte-identical files on demand. This models the core index of a real duplicate-file detector. In practice such a pipeline filters candidates by file size, then by a partial signature from the start/end of the file, then by a full hash, and finally by exact byte comparison to rule out hash collisions. Here, file contents are already given as strings, so two files count as duplicates only when their contents are **exactly equal**. ## Function Implement: ```python def solution(operations): ``` `operations` is a list of operations to process **in order**. Each operation is a list/tuple whose first element is the operation kind. ## Operations At any moment, each **path** maps to at most one current content (paths are unique). - **`('ADD', path, content)`** — set the current content at `path` to `content`. If `path` already exists, its previous content is **replaced**. - **`('MODIFY', path, content)`** — identical behavior to `ADD`: set the current content at `path` to `content`, replacing any previous content. If `path` does not already exist, it is created. - **`('DELETE', path)`** — remove `path` and its content if it exists. If `path` does not exist, the operation has **no effect**. - **`('QUERY',)`** — report the current duplicate groups (see below). ## What a QUERY returns A **duplicate group** is a set of paths whose current contents are all exactly equal. For each `QUERY`, collect every duplicate group that contains **at least 2 paths**, and: - Sort the paths **within each group** lexicographically. - Sort the **list of groups** by each group's first (lexicographically smallest) path. Paths whose content is unique (no other path currently shares it) are **not** included. If there are no duplicate groups, the `QUERY` result is an empty list `[]`. ## Return value Return a list containing one result **per `QUERY` operation**, in the order the queries occur. Each result is itself a list of groups, and each group is a list of path strings. ## Notes - Content may be the **empty string** `""`; two empty-content files are duplicates of each other. - Equality is exact: files whose contents only partially match (e.g. share a prefix and suffix) are **not** duplicates. ## Constraints - `1 <= len(operations) <= 2 * 10^5` - `0 <= len(content) <= 10^5` for a single `ADD`/`MODIFY` operation - The sum of `len(content)` over all `ADD`/`MODIFY` operations is `<= 10^6` - `1 <= len(path) <= 200`

Constraints

  • 1 <= len(operations) <= 2 * 10^5
  • 0 <= len(content) <= 10^5 for a single ADD/MODIFY operation
  • The sum of len(content) over all ADD/MODIFY operations is <= 10^6
  • 1 <= len(path) <= 200; ADD/MODIFY replace existing content at that path, and DELETE on a missing path has no effect

Examples

Input: ([('ADD', '/a.txt', 'hello'), ('ADD', '/b.txt', 'world'), ('ADD', '/c.txt', 'hello'), ('QUERY',), ('MODIFY', '/b.txt', 'hello'), ('QUERY',)],)

Expected Output: [[['/a.txt', '/c.txt']], [['/a.txt', '/b.txt', '/c.txt']]]

Explanation: The first QUERY sees '/a.txt' and '/c.txt' with identical contents. After modifying '/b.txt' to 'hello', all three files are duplicates.

Input: ([('ADD', '/x/1', 'aa'), ('ADD', '/x/2', 'bb'), ('ADD', '/y/3', 'aa'), ('ADD', '/z/4', 'bb'), ('QUERY',), ('DELETE', '/x/2'), ('QUERY',)],)

Expected Output: [[['/x/1', '/y/3'], ['/x/2', '/z/4']], [['/x/1', '/y/3']]]

Explanation: Initially there are two duplicate groups: the 'aa' files and the 'bb' files. After deleting '/x/2', only the 'aa' group remains.

Input: ([('QUERY',), ('ADD', '/empty', ''), ('ADD', '/also_empty', ''), ('ADD', '/unique', 'x'), ('QUERY',), ('MODIFY', '/also_empty', 'x'), ('QUERY',)],)

Expected Output: [[], [['/also_empty', '/empty']], [['/also_empty', '/unique']]]

Explanation: The first QUERY runs on an empty index. Then two empty files become duplicates. After '/also_empty' is changed to 'x', it becomes a duplicate of '/unique' instead.

Input: ([('ADD', '/a', 'abc'), ('MODIFY', '/b', 'abc'), ('ADD', '/a', 'def'), ('DELETE', '/missing'), ('QUERY',)],)

Expected Output: [[]]

Explanation: MODIFY acts like an upsert, so '/b' is created with 'abc'. Then '/a' is replaced with 'def'. Deleting a missing path does nothing, so no duplicate group exists at QUERY time.

Input: ([('ADD', '/p1', 'abcdefgh12345678X87654321'), ('ADD', '/p2', 'abcdefghABCDEFGHX87654321'), ('ADD', '/p3', 'abcdefgh12345678X87654321'), ('QUERY',)],)

Expected Output: [[['/p1', '/p3']]]

Explanation: Files '/p1' and '/p2' have the same length and the same beginning and ending segments, so a partial-signature filter would keep them as candidates. Exact content still shows only '/p1' and '/p3' are true duplicates.

Solution

def solution(operations):
    import hashlib

    def make_candidate(content):
        size = len(content)
        head = content[:8]
        tail = content[-8:]
        full_hash = hashlib.sha256(content.encode('utf-8')).hexdigest()
        return (size, head, tail, full_hash)

    path_meta = {}
    candidate_buckets = {}
    duplicate_ids = set()

    def remove(path):
        if path not in path_meta:
            return
        cand, content = path_meta.pop(path)
        exact_map = candidate_buckets[cand]
        group = exact_map[content]
        group.remove(path)
        exact_id = (cand, content)

        if len(group) >= 2:
            duplicate_ids.add(exact_id)
        else:
            duplicate_ids.discard(exact_id)

        if not group:
            del exact_map[content]
            duplicate_ids.discard(exact_id)
            if not exact_map:
                del candidate_buckets[cand]

    def add_or_replace(path, content):
        remove(path)
        cand = make_candidate(content)
        exact_map = candidate_buckets.setdefault(cand, {})
        group = exact_map.setdefault(content, set())
        group.add(path)
        exact_id = (cand, content)
        if len(group) >= 2:
            duplicate_ids.add(exact_id)
        path_meta[path] = (cand, content)

    answers = []

    for op in operations:
        kind = op[0]
        if kind == 'ADD' or kind == 'MODIFY':
            _, path, content = op
            add_or_replace(path, content)
        elif kind == 'DELETE':
            _, path = op
            remove(path)
        elif kind == 'QUERY':
            groups = []
            for cand, content in duplicate_ids:
                groups.append(sorted(candidate_buckets[cand][content]))
            groups.sort(key=lambda group: (group[0], group))
            answers.append(groups)
        else:
            raise ValueError(f'Unknown operation: {kind}')

    return answers
Explanation
The solution maintains an **incremental duplicate index** that stays correct across `ADD`/`MODIFY`/`DELETE` and answers each `QUERY` in time proportional to what it returns. **Cascade key.** `make_candidate(content)` builds a tuple `(size, head, tail, full_hash)` — exactly the real-world filter chain (file size → partial signature from the first/last 8 chars → SHA-256 of the whole content). This is a cheap "probably-equal" fingerprint. **State.** - `path_meta`: `path → (candidate, content)`, so a path can be found and removed in O(1). - `candidate_buckets`: `candidate → {content: set(paths)}`. The inner dict keys on the **full content**, so two strings that share a candidate fingerprint but differ are kept in separate groups — this is what eliminates hash collisions and guarantees groups are *true* (byte-exact) duplicates. - `duplicate_ids`: the set of `(candidate, content)` keys whose path-set currently has **≥2 members**. **Mutations.** `add_or_replace` first `remove`s any prior content at the path (so MODIFY = delete + insert), then inserts the path into the right inner group and adds the key to `duplicate_ids` once the group reaches size 2. `remove` pulls the path out and re-checks: if the group dropped below 2 (or emptied), the key is discarded from `duplicate_ids` and empty containers are pruned. **Query.** Because `duplicate_ids` already holds exactly the groups with ≥2 paths, QUERY just iterates those keys, sorts each group lexicographically, and sorts the groups by their first path (the `(group[0], group)` key) — no scan over non-duplicate files. Correctness follows from `duplicate_ids` being kept in lock-step with every mutation.

Time complexity: ADD/MODIFY: O(L) average (L = content length, dominated by hashing + dict lookups keyed on content); DELETE: O(L) average for the content-keyed lookup, O(1) for the rest; QUERY: O(P log P + G log G) where P is the total paths returned and G the number of duplicate groups in that query (sorting each group plus sorting the groups).. Space complexity: O(F + C), where F is the number of currently indexed files (entries in path_meta and the path sets) and C is the total stored content length (each path's content is retained as a dict key)..

Hints

  1. Keep a forward map from path to its current bucket so an update can remove the old version in O(1) average time before reinserting the new one.
  2. Use a reverse index keyed by size, partial signature, full hash, and exact content. Also track which exact-content buckets currently have at least two paths so QUERY does not need to rescan every stored file.
Last updated: Apr 20, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement a Banking System - Anthropic (medium)
  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)