Design file deduplication across nested directories
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's competency in file system traversal, robust I/O handling including symbolic links and cycles, content-based deduplication, and algorithmic efficiency within the Coding & Algorithms domain.
Constraints
- 1 <= len(fs) <= 10^5
- The file system graph may contain cycles due to symbolic links or repeated child references
- File contents are strings
- The total length of all reachable file contents is at most 10^6 characters
- Broken symlinks may appear and should be ignored
Examples
Input: ({'/': {'type': 'dir', 'children': ['/docs', '/img', '/readme.txt']}, '/docs': {'type': 'dir', 'children': ['/docs/a.txt', '/docs/b.txt']}, '/img': {'type': 'dir', 'children': ['/img/pic.bin']}, '/readme.txt': {'type': 'file', 'content': 'hello'}, '/docs/a.txt': {'type': 'file', 'content': 'alpha'}, '/docs/b.txt': {'type': 'file', 'content': 'hello'}, '/img/pic.bin': {'type': 'file', 'content': 'xyz'}}, '/')
Expected Output: [['/docs/b.txt', '/readme.txt']]
Explanation: Only '/docs/b.txt' and '/readme.txt' have identical content.
Input: ({'/': {'type': 'dir', 'children': ['/A', '/B']}, '/A': {'type': 'dir', 'children': ['/A/f1.txt', '/A/to_root']}, '/B': {'type': 'dir', 'children': ['/B/f2.txt', '/B/link_to_A']}, '/A/f1.txt': {'type': 'file', 'content': 'same-data'}, '/B/f2.txt': {'type': 'file', 'content': 'same-data'}, '/A/to_root': {'type': 'symlink', 'target': '/'}, '/B/link_to_A': {'type': 'symlink', 'target': '/A'}}, '/')
Expected Output: [['/A/f1.txt', '/B/f2.txt']]
Explanation: The symlinks create cycles back to already visited directories, but the traversal remains safe. The two real files are duplicates.
Input: ({'/': {'type': 'dir', 'children': ['/x.txt', '/alias', '/sub']}, '/x.txt': {'type': 'file', 'content': 'qq'}, '/alias': {'type': 'symlink', 'target': '/x.txt'}, '/sub': {'type': 'dir', 'children': ['/sub/y.txt']}, '/sub/y.txt': {'type': 'file', 'content': 'qq'}}, '/')
Expected Output: [['/sub/y.txt', '/x.txt']]
Explanation: The symlink '/alias' points to the same real file as '/x.txt', so that file is processed once. '/sub/y.txt' is a true duplicate of '/x.txt'.
Input: ({'/': {'type': 'dir', 'children': ['/p', '/q', '/r']}, '/p': {'type': 'file', 'content': 'ABCDEFGHxxIJKLMNOP'}, '/q': {'type': 'file', 'content': 'ABCDEFGHyyIJKLMNOP'}, '/r': {'type': 'file', 'content': 'ABCDEFGHxxIJKLMNOP'}}, '/')
Expected Output: [['/p', '/r']]
Explanation: All three files have the same size and the same prefix/suffix, so the partial check is not enough. The full comparison shows that only '/p' and '/r' are duplicates.
Input: ({'/': {'type': 'dir', 'children': ['/broken']}, '/broken': {'type': 'symlink', 'target': '/missing'}}, '/')
Expected Output: []
Explanation: The only child is a broken symlink, so no files are reachable.
Input: ({'/': {'type': 'dir', 'children': ['/a.txt', '/b.txt', '/d', '/e.txt']}, '/a.txt': {'type': 'file', 'content': '1'}, '/b.txt': {'type': 'file', 'content': '1'}, '/d': {'type': 'dir', 'children': ['/d/c.txt', '/d/f.txt']}, '/d/c.txt': {'type': 'file', 'content': '2'}, '/d/f.txt': {'type': 'file', 'content': '2'}, '/e.txt': {'type': 'file', 'content': '3'}}, '/')
Expected Output: [['/a.txt', '/b.txt'], ['/d/c.txt', '/d/f.txt']]
Explanation: There are two separate duplicate groups, and the final list is sorted by each group's first path.
Solution
def solution(fs, root):
from collections import defaultdict
import hashlib
PARTIAL_CHUNK = 8
HASH_CHUNK = 4096
def resolve_path(path):
seen = set()
cur = path
while cur in fs and fs[cur].get('type') == 'symlink':
if cur in seen:
return None
seen.add(cur)
cur = fs[cur].get('target')
if cur not in fs:
return None
return cur
visited_dirs = set()
visited_files = set()
stack = [root]
files = []
while stack:
path = stack.pop()
real = resolve_path(path)
if real is None:
continue
node = fs.get(real)
node_type = node.get('type')
if node_type == 'dir':
if real in visited_dirs:
continue
visited_dirs.add(real)
for child in reversed(node.get('children', [])):
stack.append(child)
elif node_type == 'file':
if real in visited_files:
continue
visited_files.add(real)
files.append((real, node.get('content', '')))
size_groups = defaultdict(list)
for path, content in files:
size_groups[len(content)].append((path, content))
def partial_key(content):
if len(content) <= 2 * PARTIAL_CHUNK:
return (content, '')
return (content[:PARTIAL_CHUNK], content[-PARTIAL_CHUNK:])
def full_hash(content):
h = hashlib.sha256()
for i in range(0, len(content), HASH_CHUNK):
h.update(content[i:i + HASH_CHUNK].encode('utf-8'))
return h.hexdigest()
result = []
for group in size_groups.values():
if len(group) < 2:
continue
partial_groups = defaultdict(list)
for path, content in group:
partial_groups[partial_key(content)].append((path, content))
for candidates in partial_groups.values():
if len(candidates) < 2:
continue
hash_groups = defaultdict(list)
for path, content in candidates:
hash_groups[full_hash(content)].append((path, content))
for same_hash in hash_groups.values():
if len(same_hash) < 2:
continue
exact_groups = defaultdict(list)
for path, content in same_hash:
exact_groups[content].append(path)
for paths in exact_groups.values():
if len(paths) > 1:
result.append(sorted(paths))
result.sort(key=lambda group: (group[0], len(group), group))
return resultTime complexity: O(V + E + C + R log R), where V is the number of reachable nodes, E is the number of traversed references, C is the total length of reachable file contents in the worst case, and R is the total number of paths returned in duplicate groups.. Space complexity: O(V + F), where V is the number of reachable graph nodes tracked in visited sets and F is the number of reachable real files grouped during deduplication..
Hints
- Treat the structure as a graph, not a tree. Keep visited sets for resolved directories and resolved files so symlink cycles do not cause infinite traversal.
- Do not fully compare every file against every other file. First group by size, then by a cheap prefix/suffix signature, and only then compute a full hash for the remaining candidates.