Find and remove duplicate files
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates the ability to design scalable file deduplication algorithms, specifically testing knowledge of hashing strategies, collision handling, memory and I/O optimization, handling very large files, incremental/resumable operation, and complexity and trade-off analysis.
Constraints
- 0 <= len(files) <= 200000
- Each path is unique
- 0 <= len(content) <= 100000 for each file
- The sum of all content lengths is at most 2000000 in this coding version
Examples
Input: ([('root/a.txt', 'abc'), ('root/b.txt', 'xyz'), ('root/c.txt', 'abc'), ('root/d.txt', 'xyz'), ('root/e.txt', 'p')], True)
Expected Output: ([['root/a.txt', 'root/c.txt'], ['root/b.txt', 'root/d.txt']], ['root/c.txt', 'root/d.txt'])
Explanation: Files root/a.txt and root/c.txt match exactly, and root/b.txt and root/d.txt match exactly. root/e.txt is unique.
Input: ([('a', 'cat'), ('b', 'cat'), ('c', 'dog')], False)
Expected Output: ([['a', 'b']], [])
Explanation: The duplicate group is still reported, but remove is False so no path is marked for deletion.
Input: ([('a', ''), ('b', ''), ('c', 'x')], True)
Expected Output: ([['a', 'b']], ['b'])
Explanation: Empty files are valid content, so a and b are duplicates and b is removed because a is lexicographically smaller.
Input: ([('p1', 'abcd1234wxyz'), ('p2', 'abcdZZZZwxyz'), ('p3', 'abcd1234wxyz')], True)
Expected Output: ([['p1', 'p3']], ['p3'])
Explanation: All three files have the same size and the same prefix/suffix, so a partial fingerprint alone is not enough. Only p1 and p3 are exact matches after full hash and exact verification.
Input: ([('z', 'm'), ('a', 'm'), ('b', 'm'), ('x', 'n')], True)
Expected Output: ([['a', 'b', 'z']], ['b', 'z'])
Explanation: Three files have the same content. The lexicographically smallest path a is kept, and b and z are removable.
Input: ([('only', 'data')], True)
Expected Output: ([], [])
Explanation: A single file cannot form a duplicate group.
Solution
def solution(files, remove):
from collections import defaultdict
import hashlib
CHUNK = 4
size_groups = defaultdict(list)
for path, content in files:
size_groups[len(content)].append((path, content))
duplicate_groups = []
for candidates in size_groups.values():
if len(candidates) < 2:
continue
partial_groups = defaultdict(list)
for path, content in candidates:
if len(content) <= 2 * CHUNK:
partial = (content, "")
else:
partial = (content[:CHUNK], content[-CHUNK:])
partial_groups[partial].append((path, content))
for partial_bucket in partial_groups.values():
if len(partial_bucket) < 2:
continue
full_hash_groups = defaultdict(list)
for path, content in partial_bucket:
digest = hashlib.sha256(content.encode("utf-8")).digest()
full_hash_groups[digest].append((path, content))
for hash_bucket in full_hash_groups.values():
if len(hash_bucket) < 2:
continue
exact_groups = defaultdict(list)
for path, content in hash_bucket:
exact_groups[content].append(path)
for paths in exact_groups.values():
if len(paths) >= 2:
paths.sort()
duplicate_groups.append(paths)
duplicate_groups.sort(key=lambda group: (group[0], len(group), group))
removed_paths = []
if remove:
for group in duplicate_groups:
removed_paths.extend(group[1:])
removed_paths.sort()
return duplicate_groups, removed_pathsTime complexity: O(T + n log n), where T is the total number of characters across all file contents; the log factor comes from sorting duplicate groups and removable paths.. Space complexity: O(n) extra space, excluding the input contents..
Hints
- Start by grouping files by length; files with different sizes can never be duplicates.
- Use a multi-stage fingerprint: cheap prefix/suffix key first, full hash next, and exact content equality last to protect against collisions.