Implement file deduplication at scale
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates a candidate's ability to design and implement scalable file deduplication tools, covering competencies in file-system and OS APIs, efficient hashing and large-file handling, I/O and memory trade-offs, and handling filesystem semantics such as symlinks, permissions, and error cases.
Constraints
- 0 <= len(entries) <= 100000
- 0 <= prefix_bytes <= 4096
- 1 <= chunk_size <= 4096
- The total number of bytes across readable regular files is at most 10^7
- For valid readable file records, 'content' is a bytes object and 'path' values are unique
Examples
Input: ([{'path': '/a/report.txt', 'type': 'file', 'readable': True, 'content': b'hello world'}, {'path': '/b/copy.txt', 'type': 'file', 'readable': True, 'content': b'hello world'}, {'path': '/c/other.txt', 'type': 'file', 'readable': True, 'content': b'HELLO WORLD'}, {'path': '/d/link', 'type': 'symlink', 'readable': True, 'content': b'hello world'}], 4, 3)
Expected Output: [['/a/report.txt', '/b/copy.txt']]
Explanation: The two readable regular files with identical content are duplicates. The symlink is ignored, and '/c/other.txt' differs by content.
Input: ([{'path': '/x/a.bin', 'type': 'file', 'readable': True, 'content': b'abcdef12'}, {'path': '/x/b.bin', 'type': 'file', 'readable': True, 'content': b'abcdef34'}, {'path': '/x/c.bin', 'type': 'file', 'readable': True, 'content': b'abcdef56'}, {'path': '/x/private.bin', 'type': 'file', 'readable': False, 'content': b'abcdef12'}], 6, 2)
Expected Output: []
Explanation: All readable files have the same size and first 6 bytes, but their full contents differ, so no duplicate group should be reported.
Input: ([{'path': '/empty/one', 'type': 'file', 'readable': True, 'content': b''}, {'path': '/empty/two', 'type': 'file', 'readable': True, 'content': b''}, {'path': '/empty/skip', 'type': 'file', 'readable': False, 'content': b''}, {'path': '/empty/dir', 'type': 'dir', 'readable': True, 'content': b''}], 8, 4)
Expected Output: [['/empty/one', '/empty/two']]
Explanation: Empty readable regular files are duplicates of each other. The unreadable file and directory are ignored.
Input: ([{'path': '/z/f1.txt', 'type': 'file', 'readable': True, 'content': b'same1'}, {'path': '/a/f2.txt', 'type': 'file', 'readable': True, 'content': b'same1'}, {'path': '/m/f3.txt', 'type': 'file', 'readable': True, 'content': b'data'}, {'path': '/b/f4.txt', 'type': 'file', 'readable': True, 'content': b'data'}, {'path': '/c/f5.txt', 'type': 'file', 'readable': True, 'content': b'unique'}], 2, 2)
Expected Output: [['/a/f2.txt', '/z/f1.txt'], ['/b/f4.txt', '/m/f3.txt']]
Explanation: There are two duplicate groups: one for content 'same1' and one for content 'data'. Paths inside groups and the groups themselves are sorted lexicographically.
Input: ([], 8, 4)
Expected Output: []
Explanation: With no entries, there are no duplicates.
Solution
def solution(entries, prefix_bytes, chunk_size):
from collections import defaultdict
import hashlib
if prefix_bytes < 0 or chunk_size <= 0:
raise ValueError('prefix_bytes must be >= 0 and chunk_size must be > 0')
def digest_bytes(data):
hasher = hashlib.sha256()
view = memoryview(data)
for i in range(0, len(data), chunk_size):
hasher.update(view[i:i + chunk_size])
return hasher.hexdigest()
size_groups = defaultdict(list)
for entry in entries:
if not isinstance(entry, dict):
continue
if entry.get('type') != 'file' or not entry.get('readable', False):
continue
path = entry.get('path')
content = entry.get('content')
if path is None or not isinstance(content, (bytes, bytearray)):
continue
if isinstance(content, bytearray):
content = bytes(content)
size_groups[len(content)].append((path, content))
duplicates = []
for files_same_size in size_groups.values():
if len(files_same_size) < 2:
continue
prefix_groups = defaultdict(list)
for path, content in files_same_size:
prefix_groups[content[:prefix_bytes]].append((path, content))
for files_same_prefix in prefix_groups.values():
if len(files_same_prefix) < 2:
continue
hash_groups = defaultdict(list)
for path, content in files_same_prefix:
digest = digest_bytes(content)
hash_groups[digest].append((path, content))
for files_same_hash in hash_groups.values():
if len(files_same_hash) < 2:
continue
exact_groups = defaultdict(list)
for path, content in files_same_hash:
exact_groups[content].append(path)
for paths in exact_groups.values():
if len(paths) > 1:
paths.sort()
duplicates.append(paths)
duplicates.sort(key=lambda group: group[0])
return duplicatesTime complexity: O(n + B + n log n) worst-case, where n is the number of entries and B is the total number of bytes hashed across candidate files. Space complexity: O(n).
Hints
- Files with different sizes can never be duplicates, so size is a very strong first filter.
- Inside a candidate bucket, use a hash map from digest to paths so you avoid O(k^2) pairwise comparisons.