How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a Medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Applied Intuition.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Applied Intuition during technical interviews.

Design duplicate-file detection using size | Applied Intuition Coding Question

Design duplicate-file detection using size

Company: Applied Intuition

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

You are given a large POSIX-like filesystem containing millions of files across nested directories. Design an algorithm to find groups of duplicate files when you are not allowed to read file contents; you may only use metadata such as file size, timestamps, permissions, and inode information. Describe how you will traverse the directory tree, avoid infinite loops with symlinks, account for hard links, and handle permission errors. Explain your data structures (e.g., mapping sizes to file lists), memory and I/O considerations for very large datasets, and how you would stream or partition the work. Define what 'duplicate' means under the no-content-read constraint, discuss potential false positives and their impact, and provide time and space complexity analyses.

Quick Answer: This question evaluates a candidate's understanding of filesystem metadata, scalable traversal algorithms, handling symlink and hard link edge cases, and memory/I/O trade-offs when grouping files by attributes such as size.

You are given a simulated POSIX-like filesystem and a starting directory `root`. Implement the core of a duplicate-file detector when file contents cannot be read. Each entry in `filesystem` is stored by absolute path and is one of: - `('dir', [child_name1, child_name2, ...], readable)` - `('file', size, inode, readable)` - `('symlink', target_path)` Traverse the tree starting from `root` and find duplicate files using only metadata. Rules: 1. Never follow symlinks. This avoids infinite loops. 2. If a directory is unreadable, count one permission error and do not traverse inside it. 3. If a file is unreadable, count one permission error and skip it. 4. Hard links are represented by multiple file paths with the same inode. They refer to the same physical file, so they must be counted only once. Use the lexicographically smallest readable reachable path as that inode's representative. 5. Under the no-content-read constraint, two files are considered duplicates if they have different inodes but the same size. Return a tuple `(groups, permission_errors)` where: - `groups` is a list of duplicate groups. - Each duplicate group is a lexicographically sorted list of representative file paths. - Only include sizes that have at least 2 distinct representative files. - Order `groups` by increasing file size. This definition can produce false positives in real systems, but for this problem it is the required behavior.

Constraints

1 <= len(filesystem) <= 200000
0 <= file size <= 10^12
All paths are unique absolute paths, and `root` exists in `filesystem`
If two readable reachable file nodes share the same inode, they represent the same physical file and have the same size

Examples

Input: ({'/': ('dir', ['docs', 'pics', 'shortcut', 'secret'], True), '/docs': ('dir', ['a.txt', 'b.txt', 'same.txt'], True), '/docs/a.txt': ('file', 100, 1, True), '/docs/b.txt': ('file', 100, 2, True), '/docs/same.txt': ('file', 100, 1, True), '/pics': ('dir', ['img1.jpg', 'img2.jpg'], True), '/pics/img1.jpg': ('file', 200, 3, True), '/pics/img2.jpg': ('file', 200, 4, False), '/shortcut': ('symlink', '/'), '/secret': ('dir', ['hidden.txt'], False), '/secret/hidden.txt': ('file', 100, 5, True)}, '/')

Expected Output: ([['/docs/a.txt', '/docs/b.txt']], 2)

Explanation: `/shortcut` is ignored because it is a symlink. `/secret` is unreadable, so it adds one permission error and is not traversed. `/pics/img2.jpg` is an unreadable file, adding another error. `/docs/a.txt` and `/docs/same.txt` share inode 1, so only `/docs/a.txt` is kept as the representative hard link. The only duplicate-by-size group is size 100 with two distinct inodes.

Input: ({'/': ('dir', ['a'], False), '/a': ('file', 10, 1, True)}, '/')

Expected Output: ([], 1)

Explanation: The root directory itself is unreadable, so traversal stops immediately with one permission error.

Input: ({'/': ('dir', ['x', 'y', 'link'], True), '/x': ('dir', ['f1', 'f2'], True), '/x/f1': ('file', 0, 10, True), '/x/f2': ('file', 0, 11, True), '/y': ('dir', ['g1', 'g2', 'g3'], True), '/y/g1': ('file', 0, 10, True), '/y/g2': ('file', 5, 12, True), '/y/g3': ('file', 5, 13, True), '/link': ('symlink', '/x')}, '/')

Expected Output: ([['/x/f1', '/x/f2'], ['/y/g2', '/y/g3']], 0)

Explanation: The symlink is ignored. `/x/f1` and `/y/g1` are hard links to the same inode, so only `/x/f1` is kept. Size 0 has two distinct inodes (`/x/f1`, `/x/f2`), and size 5 has two distinct inodes (`/y/g2`, `/y/g3`).

Input: ({'/': ('dir', ['a', 'b'], True), '/a': ('file', 7, 1, True), '/b': ('file', 7, 1, True)}, '/')

Expected Output: ([], 0)

Explanation: Both paths point to the same inode, so there is only one physical file. That means there is no duplicate group even though the sizes match.

Solution

def solution(filesystem, root):
    from collections import defaultdict

    def child_path(parent, name):
        return '/' + name if parent == '/' else parent + '/' + name

    if root not in filesystem:
        return ([], 0)

    permission_errors = 0
    stack = [root]
    inode_to_path = {}
    inode_to_size = {}

    while stack:
        path = stack.pop()
        node = filesystem.get(path)
        if node is None:
            continue

        kind = node[0]

        if kind == 'symlink':
            continue

        if kind == 'dir':
            entries, readable = node[1], node[2]
            if not readable:
                permission_errors += 1
                continue
            for name in entries:
                stack.append(child_path(path, name))

        elif kind == 'file':
            size, inode, readable = node[1], node[2], node[3]
            if not readable:
                permission_errors += 1
                continue

            best = inode_to_path.get(inode)
            if best is None or path < best:
                inode_to_path[inode] = path
                inode_to_size[inode] = size

    size_to_paths = defaultdict(list)
    for inode, path in inode_to_path.items():
        size_to_paths[inode_to_size[inode]].append(path)

    groups = []
    for size in sorted(size_to_paths):
        paths = sorted(size_to_paths[size])
        if len(paths) >= 2:
            groups.append(paths)

    return (groups, permission_errors)

Time complexity: O(n + u log u), where n is the number of reachable nodes visited and u is the number of unique readable file inodes. Space complexity: O(u + s), where u is the number of unique readable file inodes and s is the maximum stack size during traversal.

Hints

Use an iterative DFS or BFS with a stack or queue so very deep directory trees do not cause recursion-depth problems.
First deduplicate by inode, then group the surviving representative paths by file size.

Quick Overview

This question evaluates a candidate's understanding of filesystem metadata, scalable traversal algorithms, handling symlink and hard link edge cases, and memory/I/O trade-offs when grouping files by attributes such as size.