Detect duplicate files by content
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates filesystem traversal, efficient I/O and memory use, hashing strategies for content comparison, duplicate-detection algorithms, and time/space complexity analysis.
Constraints
- 0 <= len(paths) == len(contents) == len(readable) == len(is_symlink) <= 10000
- Each paths[i] is unique.
- contents[i] is a string representing raw bytes; every character has code point 0 through 255.
- 0 <= len(contents[i]) <= 1000000
- The total content length in the test data is at most 5000000.
- Files with readable[i] == False or is_symlink[i] == True must be ignored.
Examples
Input: (['/root/a.txt', '/root/b.txt', '/root/c.txt', '/root/sub/d.txt', '/root/e.bin'], ['hello', 'world', 'hello', 'hello', 'hello!'], [True, True, True, True, True], [False, False, False, False, False])
Expected Output: [['/root/a.txt', '/root/c.txt', '/root/sub/d.txt']]
Explanation: Three readable regular files contain exactly 'hello'. The file containing 'world' differs, and 'hello!' has a different size.
Input: (['/a/empty1', '/b/empty2', '/a/img1', '/b/img2', '/c/text'], ['', '', 'ABC', 'ABC', 'ABD'], [True, True, True, True, True], [False, False, False, False, False])
Expected Output: [['/a/empty1', '/b/empty2'], ['/a/img1', '/b/img2']]
Explanation: The two empty files are duplicates, and the two files containing 'ABC' are duplicates. 'ABD' has the same size as 'ABC' but different content.
Input: (['/readable/original', '/secret/copy', '/link/copy', '/readable/other'], ['data', 'data', 'data', 'other'], [True, False, True, True], [False, False, True, False])
Expected Output: []
Explanation: The unreadable file and symbolic link are ignored, leaving no duplicate group among readable regular files.
Input: (['/x/one', '/x/two', '/x/three', '/x/four'], ['ab', 'ba', 'cd', 'dc'], [True, True, True, True], [False, False, False, False])
Expected Output: []
Explanation: All files have the same size, but none have identical content.
Input: ([], [], [], [])
Expected Output: []
Explanation: An empty directory scan contains no duplicate files.
Solution
def solution(paths, contents, readable, is_symlink):
import hashlib
import zlib
CHUNK_SIZE = 4096
def iter_encoded_chunks(data):
for start in range(0, len(data), CHUNK_SIZE):
yield data[start:start + CHUNK_SIZE].encode('latin1')
def fast_hash(data):
value = 0
for chunk in iter_encoded_chunks(data):
value = zlib.crc32(chunk, value)
return value
def crypto_hash(data):
h = hashlib.sha256()
for chunk in iter_encoded_chunks(data):
h.update(chunk)
return h.digest()
def same_content(a, b):
if len(a) != len(b):
return False
for start in range(0, len(a), CHUNK_SIZE):
if a[start:start + CHUNK_SIZE] != b[start:start + CHUNK_SIZE]:
return False
return True
by_size = {}
for i in range(len(paths)):
if not readable[i] or is_symlink[i]:
continue
size = len(contents[i])
by_size.setdefault(size, []).append((paths[i], contents[i]))
answer = []
for size_group in by_size.values():
if len(size_group) < 2:
continue
by_fast_hash = {}
for path, data in size_group:
key = fast_hash(data)
by_fast_hash.setdefault(key, []).append((path, data))
for fast_group in by_fast_hash.values():
if len(fast_group) < 2:
continue
by_crypto_hash = {}
for path, data in fast_group:
key = crypto_hash(data)
by_crypto_hash.setdefault(key, []).append((path, data))
for hash_group in by_crypto_hash.values():
if len(hash_group) < 2:
continue
exact_groups = []
for path, data in hash_group:
placed = False
for group in exact_groups:
representative, group_paths = group
if same_content(data, representative):
group_paths.append(path)
placed = True
break
if not placed:
exact_groups.append([data, [path]])
for representative, group_paths in exact_groups:
if len(group_paths) >= 2:
answer.append(sorted(group_paths))
answer.sort(key=lambda group: group[0])
return answerTime complexity: O(n + B) expected, where n is the number of entries and B is the total number of bytes streamed from readable non-symlink files that survive size filtering. Candidate files may be streamed a constant number of times for fast hashing, cryptographic hashing, and final verification.. Space complexity: O(n + R) auxiliary space, where n is the number of entries and R is the size of the returned duplicate groups. This excludes the input contents..
Hints
- Files with different sizes cannot have identical byte content, so use file size as the first grouping key.
- Hashes are useful for reducing comparisons, but do not trust hashes alone; verify candidate duplicates byte-by-byte against a representative.