Design duplicate-file detection using size
Company: Applied Intuition
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's understanding of filesystem metadata, scalable traversal algorithms, handling symlink and hard link edge cases, and memory/I/O trade-offs when grouping files by attributes such as size.
Constraints
- 1 <= len(filesystem) <= 200000
- 0 <= file size <= 10^12
- All paths are unique absolute paths, and `root` exists in `filesystem`
- If two readable reachable file nodes share the same inode, they represent the same physical file and have the same size
Examples
Input: ({'/': ('dir', ['docs', 'pics', 'shortcut', 'secret'], True), '/docs': ('dir', ['a.txt', 'b.txt', 'same.txt'], True), '/docs/a.txt': ('file', 100, 1, True), '/docs/b.txt': ('file', 100, 2, True), '/docs/same.txt': ('file', 100, 1, True), '/pics': ('dir', ['img1.jpg', 'img2.jpg'], True), '/pics/img1.jpg': ('file', 200, 3, True), '/pics/img2.jpg': ('file', 200, 4, False), '/shortcut': ('symlink', '/'), '/secret': ('dir', ['hidden.txt'], False), '/secret/hidden.txt': ('file', 100, 5, True)}, '/')
Expected Output: ([['/docs/a.txt', '/docs/b.txt']], 2)
Explanation: `/shortcut` is ignored because it is a symlink. `/secret` is unreadable, so it adds one permission error and is not traversed. `/pics/img2.jpg` is an unreadable file, adding another error. `/docs/a.txt` and `/docs/same.txt` share inode 1, so only `/docs/a.txt` is kept as the representative hard link. The only duplicate-by-size group is size 100 with two distinct inodes.
Input: ({'/': ('dir', ['a'], False), '/a': ('file', 10, 1, True)}, '/')
Expected Output: ([], 1)
Explanation: The root directory itself is unreadable, so traversal stops immediately with one permission error.
Input: ({'/': ('dir', ['x', 'y', 'link'], True), '/x': ('dir', ['f1', 'f2'], True), '/x/f1': ('file', 0, 10, True), '/x/f2': ('file', 0, 11, True), '/y': ('dir', ['g1', 'g2', 'g3'], True), '/y/g1': ('file', 0, 10, True), '/y/g2': ('file', 5, 12, True), '/y/g3': ('file', 5, 13, True), '/link': ('symlink', '/x')}, '/')
Expected Output: ([['/x/f1', '/x/f2'], ['/y/g2', '/y/g3']], 0)
Explanation: The symlink is ignored. `/x/f1` and `/y/g1` are hard links to the same inode, so only `/x/f1` is kept. Size 0 has two distinct inodes (`/x/f1`, `/x/f2`), and size 5 has two distinct inodes (`/y/g2`, `/y/g3`).
Input: ({'/': ('dir', ['a', 'b'], True), '/a': ('file', 7, 1, True), '/b': ('file', 7, 1, True)}, '/')
Expected Output: ([], 0)
Explanation: Both paths point to the same inode, so there is only one physical file. That means there is no duplicate group even though the sizes match.
Solution
def solution(filesystem, root):
from collections import defaultdict
def child_path(parent, name):
return '/' + name if parent == '/' else parent + '/' + name
if root not in filesystem:
return ([], 0)
permission_errors = 0
stack = [root]
inode_to_path = {}
inode_to_size = {}
while stack:
path = stack.pop()
node = filesystem.get(path)
if node is None:
continue
kind = node[0]
if kind == 'symlink':
continue
if kind == 'dir':
entries, readable = node[1], node[2]
if not readable:
permission_errors += 1
continue
for name in entries:
stack.append(child_path(path, name))
elif kind == 'file':
size, inode, readable = node[1], node[2], node[3]
if not readable:
permission_errors += 1
continue
best = inode_to_path.get(inode)
if best is None or path < best:
inode_to_path[inode] = path
inode_to_size[inode] = size
size_to_paths = defaultdict(list)
for inode, path in inode_to_path.items():
size_to_paths[inode_to_size[inode]].append(path)
groups = []
for size in sorted(size_to_paths):
paths = sorted(size_to_paths[size])
if len(paths) >= 2:
groups.append(paths)
return (groups, permission_errors)Time complexity: O(n + u log u), where n is the number of reachable nodes visited and u is the number of unique readable file inodes. Space complexity: O(u + s), where u is the number of unique readable file inodes and s is the maximum stack size during traversal.
Hints
- Use an iterative DFS or BFS with a stack or queue so very deep directory trees do not cause recursion-depth problems.
- First deduplicate by inode, then group the surviving representative paths by file size.