PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates a candidate's competency in file system traversal, robust I/O handling including symbolic links and cycles, content-based deduplication, and algorithmic efficiency within the Coding & Algorithms domain.

  • Medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Design file deduplication across nested directories

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

Design and implement a file deduplication tool that, given a root directory, identifies groups of duplicate files. Requirements: traverse nested directories safely; handle symbolic links and potential cycles; minimize I/O with a multi-stage strategy (e.g., compare file sizes, then partial hashes, then full hashes); support very large files; and output duplicate groups with file paths. Explain your data structures, step-by-step algorithm, time and space complexity, and key test cases.

Quick Answer: This question evaluates a candidate's competency in file system traversal, robust I/O handling including symbolic links and cycles, content-based deduplication, and algorithmic efficiency within the Coding & Algorithms domain.

You are given a simulated file system graph and a root path. Implement a file deduplication tool that traverses everything reachable from the root, follows symbolic links safely, avoids infinite loops caused by cycles, and returns groups of duplicate files. The file system is represented by a dictionary `fs` where each key is a path string and each value is one of these node types: - Directory: `{'type': 'dir', 'children': [child_path1, child_path2, ...]}` - File: `{'type': 'file', 'content': '...'}` - Symbolic link: `{'type': 'symlink', 'target': some_path}` Rules: 1. Start traversal from `root`. 2. Follow symlinks, but safely handle cycles such as a symlink pointing back to an ancestor. 3. Ignore broken symlinks whose target does not exist. 4. Process each real directory and real file at most once after resolving symlinks. 5. Two files are duplicates if their full contents are exactly equal. 6. To mimic a real deduplication tool, use a multi-stage strategy: - compare file sizes first, - then compare a small partial signature, - then compare full hashes and exact content. 7. Return duplicate groups as sorted lists of canonical file paths (the resolved non-symlink file paths). Each group must contain at least 2 paths. 8. Sort the final list of groups by the first path in each group. A strong interview answer should also be able to explain the data structures used, the traversal strategy, why cycle handling is necessary, and the time/space complexity.

Constraints

  • 1 <= len(fs) <= 10^5
  • The file system graph may contain cycles due to symbolic links or repeated child references
  • File contents are strings
  • The total length of all reachable file contents is at most 10^6 characters
  • Broken symlinks may appear and should be ignored

Examples

Input: ({'/': {'type': 'dir', 'children': ['/docs', '/img', '/readme.txt']}, '/docs': {'type': 'dir', 'children': ['/docs/a.txt', '/docs/b.txt']}, '/img': {'type': 'dir', 'children': ['/img/pic.bin']}, '/readme.txt': {'type': 'file', 'content': 'hello'}, '/docs/a.txt': {'type': 'file', 'content': 'alpha'}, '/docs/b.txt': {'type': 'file', 'content': 'hello'}, '/img/pic.bin': {'type': 'file', 'content': 'xyz'}}, '/')

Expected Output: [['/docs/b.txt', '/readme.txt']]

Explanation: Only '/docs/b.txt' and '/readme.txt' have identical content.

Input: ({'/': {'type': 'dir', 'children': ['/A', '/B']}, '/A': {'type': 'dir', 'children': ['/A/f1.txt', '/A/to_root']}, '/B': {'type': 'dir', 'children': ['/B/f2.txt', '/B/link_to_A']}, '/A/f1.txt': {'type': 'file', 'content': 'same-data'}, '/B/f2.txt': {'type': 'file', 'content': 'same-data'}, '/A/to_root': {'type': 'symlink', 'target': '/'}, '/B/link_to_A': {'type': 'symlink', 'target': '/A'}}, '/')

Expected Output: [['/A/f1.txt', '/B/f2.txt']]

Explanation: The symlinks create cycles back to already visited directories, but the traversal remains safe. The two real files are duplicates.

Input: ({'/': {'type': 'dir', 'children': ['/x.txt', '/alias', '/sub']}, '/x.txt': {'type': 'file', 'content': 'qq'}, '/alias': {'type': 'symlink', 'target': '/x.txt'}, '/sub': {'type': 'dir', 'children': ['/sub/y.txt']}, '/sub/y.txt': {'type': 'file', 'content': 'qq'}}, '/')

Expected Output: [['/sub/y.txt', '/x.txt']]

Explanation: The symlink '/alias' points to the same real file as '/x.txt', so that file is processed once. '/sub/y.txt' is a true duplicate of '/x.txt'.

Input: ({'/': {'type': 'dir', 'children': ['/p', '/q', '/r']}, '/p': {'type': 'file', 'content': 'ABCDEFGHxxIJKLMNOP'}, '/q': {'type': 'file', 'content': 'ABCDEFGHyyIJKLMNOP'}, '/r': {'type': 'file', 'content': 'ABCDEFGHxxIJKLMNOP'}}, '/')

Expected Output: [['/p', '/r']]

Explanation: All three files have the same size and the same prefix/suffix, so the partial check is not enough. The full comparison shows that only '/p' and '/r' are duplicates.

Input: ({'/': {'type': 'dir', 'children': ['/broken']}, '/broken': {'type': 'symlink', 'target': '/missing'}}, '/')

Expected Output: []

Explanation: The only child is a broken symlink, so no files are reachable.

Input: ({'/': {'type': 'dir', 'children': ['/a.txt', '/b.txt', '/d', '/e.txt']}, '/a.txt': {'type': 'file', 'content': '1'}, '/b.txt': {'type': 'file', 'content': '1'}, '/d': {'type': 'dir', 'children': ['/d/c.txt', '/d/f.txt']}, '/d/c.txt': {'type': 'file', 'content': '2'}, '/d/f.txt': {'type': 'file', 'content': '2'}, '/e.txt': {'type': 'file', 'content': '3'}}, '/')

Expected Output: [['/a.txt', '/b.txt'], ['/d/c.txt', '/d/f.txt']]

Explanation: There are two separate duplicate groups, and the final list is sorted by each group's first path.

Solution

def solution(fs, root):
    from collections import defaultdict
    import hashlib

    PARTIAL_CHUNK = 8
    HASH_CHUNK = 4096

    def resolve_path(path):
        seen = set()
        cur = path
        while cur in fs and fs[cur].get('type') == 'symlink':
            if cur in seen:
                return None
            seen.add(cur)
            cur = fs[cur].get('target')
        if cur not in fs:
            return None
        return cur

    visited_dirs = set()
    visited_files = set()
    stack = [root]
    files = []

    while stack:
        path = stack.pop()
        real = resolve_path(path)
        if real is None:
            continue

        node = fs.get(real)
        node_type = node.get('type')

        if node_type == 'dir':
            if real in visited_dirs:
                continue
            visited_dirs.add(real)
            for child in reversed(node.get('children', [])):
                stack.append(child)
        elif node_type == 'file':
            if real in visited_files:
                continue
            visited_files.add(real)
            files.append((real, node.get('content', '')))

    size_groups = defaultdict(list)
    for path, content in files:
        size_groups[len(content)].append((path, content))

    def partial_key(content):
        if len(content) <= 2 * PARTIAL_CHUNK:
            return (content, '')
        return (content[:PARTIAL_CHUNK], content[-PARTIAL_CHUNK:])

    def full_hash(content):
        h = hashlib.sha256()
        for i in range(0, len(content), HASH_CHUNK):
            h.update(content[i:i + HASH_CHUNK].encode('utf-8'))
        return h.hexdigest()

    result = []

    for group in size_groups.values():
        if len(group) < 2:
            continue

        partial_groups = defaultdict(list)
        for path, content in group:
            partial_groups[partial_key(content)].append((path, content))

        for candidates in partial_groups.values():
            if len(candidates) < 2:
                continue

            hash_groups = defaultdict(list)
            for path, content in candidates:
                hash_groups[full_hash(content)].append((path, content))

            for same_hash in hash_groups.values():
                if len(same_hash) < 2:
                    continue

                exact_groups = defaultdict(list)
                for path, content in same_hash:
                    exact_groups[content].append(path)

                for paths in exact_groups.values():
                    if len(paths) > 1:
                        result.append(sorted(paths))

    result.sort(key=lambda group: (group[0], len(group), group))
    return result

Time complexity: O(V + E + C + R log R), where V is the number of reachable nodes, E is the number of traversed references, C is the total length of reachable file contents in the worst case, and R is the total number of paths returned in duplicate groups.. Space complexity: O(V + F), where V is the number of reachable graph nodes tracked in visited sets and F is the number of reachable real files grouped during deduplication..

Hints

  1. Treat the structure as a graph, not a tree. Keep visited sets for resolved directories and resolved files so symlink cycles do not cause infinite traversal.
  2. Do not fully compare every file against every other file. First group by size, then by a cheap prefix/suffix signature, and only then compute a full hash for the remaining candidates.
Last updated: May 18, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)
  • Implement Task Management and Duplicate Detection - Anthropic (medium)