PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches

Quick Overview

This question evaluates a candidate's understanding of filesystem metadata, scalable traversal algorithms, handling symlink and hard link edge cases, and memory/I/O trade-offs when grouping files by attributes such as size.

  • Medium
  • Applied Intuition
  • Coding & Algorithms
  • Software Engineer

Design duplicate-file detection using size

Company: Applied Intuition

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

You are given a large POSIX-like filesystem containing millions of files across nested directories. Design an algorithm to find groups of duplicate files when you are not allowed to read file contents; you may only use metadata such as file size, timestamps, permissions, and inode information. Describe how you will traverse the directory tree, avoid infinite loops with symlinks, account for hard links, and handle permission errors. Explain your data structures (e.g., mapping sizes to file lists), memory and I/O considerations for very large datasets, and how you would stream or partition the work. Define what 'duplicate' means under the no-content-read constraint, discuss potential false positives and their impact, and provide time and space complexity analyses.

Quick Answer: This question evaluates a candidate's understanding of filesystem metadata, scalable traversal algorithms, handling symlink and hard link edge cases, and memory/I/O trade-offs when grouping files by attributes such as size.

You are given a simulated POSIX-like filesystem and a starting directory `root`. Implement the core of a duplicate-file detector when file contents cannot be read. Each entry in `filesystem` is stored by absolute path and is one of: - `('dir', [child_name1, child_name2, ...], readable)` - `('file', size, inode, readable)` - `('symlink', target_path)` Traverse the tree starting from `root` and find duplicate files using only metadata. Rules: 1. Never follow symlinks. This avoids infinite loops. 2. If a directory is unreadable, count one permission error and do not traverse inside it. 3. If a file is unreadable, count one permission error and skip it. 4. Hard links are represented by multiple file paths with the same inode. They refer to the same physical file, so they must be counted only once. Use the lexicographically smallest readable reachable path as that inode's representative. 5. Under the no-content-read constraint, two files are considered duplicates if they have different inodes but the same size. Return a tuple `(groups, permission_errors)` where: - `groups` is a list of duplicate groups. - Each duplicate group is a lexicographically sorted list of representative file paths. - Only include sizes that have at least 2 distinct representative files. - Order `groups` by increasing file size. This definition can produce false positives in real systems, but for this problem it is the required behavior.

Constraints

  • 1 <= len(filesystem) <= 200000
  • 0 <= file size <= 10^12
  • All paths are unique absolute paths, and `root` exists in `filesystem`
  • If two readable reachable file nodes share the same inode, they represent the same physical file and have the same size

Examples

Input: ({'/': ('dir', ['docs', 'pics', 'shortcut', 'secret'], True), '/docs': ('dir', ['a.txt', 'b.txt', 'same.txt'], True), '/docs/a.txt': ('file', 100, 1, True), '/docs/b.txt': ('file', 100, 2, True), '/docs/same.txt': ('file', 100, 1, True), '/pics': ('dir', ['img1.jpg', 'img2.jpg'], True), '/pics/img1.jpg': ('file', 200, 3, True), '/pics/img2.jpg': ('file', 200, 4, False), '/shortcut': ('symlink', '/'), '/secret': ('dir', ['hidden.txt'], False), '/secret/hidden.txt': ('file', 100, 5, True)}, '/')

Expected Output: ([['/docs/a.txt', '/docs/b.txt']], 2)

Explanation: `/shortcut` is ignored because it is a symlink. `/secret` is unreadable, so it adds one permission error and is not traversed. `/pics/img2.jpg` is an unreadable file, adding another error. `/docs/a.txt` and `/docs/same.txt` share inode 1, so only `/docs/a.txt` is kept as the representative hard link. The only duplicate-by-size group is size 100 with two distinct inodes.

Input: ({'/': ('dir', ['a'], False), '/a': ('file', 10, 1, True)}, '/')

Expected Output: ([], 1)

Explanation: The root directory itself is unreadable, so traversal stops immediately with one permission error.

Input: ({'/': ('dir', ['x', 'y', 'link'], True), '/x': ('dir', ['f1', 'f2'], True), '/x/f1': ('file', 0, 10, True), '/x/f2': ('file', 0, 11, True), '/y': ('dir', ['g1', 'g2', 'g3'], True), '/y/g1': ('file', 0, 10, True), '/y/g2': ('file', 5, 12, True), '/y/g3': ('file', 5, 13, True), '/link': ('symlink', '/x')}, '/')

Expected Output: ([['/x/f1', '/x/f2'], ['/y/g2', '/y/g3']], 0)

Explanation: The symlink is ignored. `/x/f1` and `/y/g1` are hard links to the same inode, so only `/x/f1` is kept. Size 0 has two distinct inodes (`/x/f1`, `/x/f2`), and size 5 has two distinct inodes (`/y/g2`, `/y/g3`).

Input: ({'/': ('dir', ['a', 'b'], True), '/a': ('file', 7, 1, True), '/b': ('file', 7, 1, True)}, '/')

Expected Output: ([], 0)

Explanation: Both paths point to the same inode, so there is only one physical file. That means there is no duplicate group even though the sizes match.

Solution

def solution(filesystem, root):
    from collections import defaultdict

    def child_path(parent, name):
        return '/' + name if parent == '/' else parent + '/' + name

    if root not in filesystem:
        return ([], 0)

    permission_errors = 0
    stack = [root]
    inode_to_path = {}
    inode_to_size = {}

    while stack:
        path = stack.pop()
        node = filesystem.get(path)
        if node is None:
            continue

        kind = node[0]

        if kind == 'symlink':
            continue

        if kind == 'dir':
            entries, readable = node[1], node[2]
            if not readable:
                permission_errors += 1
                continue
            for name in entries:
                stack.append(child_path(path, name))

        elif kind == 'file':
            size, inode, readable = node[1], node[2], node[3]
            if not readable:
                permission_errors += 1
                continue

            best = inode_to_path.get(inode)
            if best is None or path < best:
                inode_to_path[inode] = path
                inode_to_size[inode] = size

    size_to_paths = defaultdict(list)
    for inode, path in inode_to_path.items():
        size_to_paths[inode_to_size[inode]].append(path)

    groups = []
    for size in sorted(size_to_paths):
        paths = sorted(size_to_paths[size])
        if len(paths) >= 2:
            groups.append(paths)

    return (groups, permission_errors)

Time complexity: O(n + u log u), where n is the number of reachable nodes visited and u is the number of unique readable file inodes. Space complexity: O(u + s), where u is the number of unique readable file inodes and s is the maximum stack size during traversal.

Hints

  1. Use an iterative DFS or BFS with a stack or queue so very deep directory trees do not cause recursion-depth problems.
  2. First deduplicate by inode, then group the surviving representative paths by file size.
Last updated: May 9, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Design a nested transaction store - Applied Intuition (Medium)
  • Design a coupon pricing engine - Applied Intuition (Medium)
  • Implement transactional key–value store - Applied Intuition (Medium)
  • Find grid cell minimizing sum distances - Applied Intuition (Medium)
  • Design a transactional in-memory key–value store - Applied Intuition (Medium)