Detect duplicate files efficiently
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates understanding of scalable file-deduplication and system-design concepts, including content hashing, I/O minimization, hash-collision handling, memory constraints, parallelization, incremental updates, and cross-machine deduplication.
Constraints
- 1 <= len(operations) <= 2 * 10^5
- 0 <= len(content) <= 10^5 for a single ADD/MODIFY operation
- The sum of len(content) over all ADD/MODIFY operations is <= 10^6
- 1 <= len(path) <= 200; ADD/MODIFY replace existing content at that path, and DELETE on a missing path has no effect
Examples
Input: ([('ADD', '/a.txt', 'hello'), ('ADD', '/b.txt', 'world'), ('ADD', '/c.txt', 'hello'), ('QUERY',), ('MODIFY', '/b.txt', 'hello'), ('QUERY',)],)
Expected Output: [[['/a.txt', '/c.txt']], [['/a.txt', '/b.txt', '/c.txt']]]
Explanation: The first QUERY sees '/a.txt' and '/c.txt' with identical contents. After modifying '/b.txt' to 'hello', all three files are duplicates.
Input: ([('ADD', '/x/1', 'aa'), ('ADD', '/x/2', 'bb'), ('ADD', '/y/3', 'aa'), ('ADD', '/z/4', 'bb'), ('QUERY',), ('DELETE', '/x/2'), ('QUERY',)],)
Expected Output: [[['/x/1', '/y/3'], ['/x/2', '/z/4']], [['/x/1', '/y/3']]]
Explanation: Initially there are two duplicate groups: the 'aa' files and the 'bb' files. After deleting '/x/2', only the 'aa' group remains.
Input: ([('QUERY',), ('ADD', '/empty', ''), ('ADD', '/also_empty', ''), ('ADD', '/unique', 'x'), ('QUERY',), ('MODIFY', '/also_empty', 'x'), ('QUERY',)],)
Expected Output: [[], [['/also_empty', '/empty']], [['/also_empty', '/unique']]]
Explanation: The first QUERY runs on an empty index. Then two empty files become duplicates. After '/also_empty' is changed to 'x', it becomes a duplicate of '/unique' instead.
Input: ([('ADD', '/a', 'abc'), ('MODIFY', '/b', 'abc'), ('ADD', '/a', 'def'), ('DELETE', '/missing'), ('QUERY',)],)
Expected Output: [[]]
Explanation: MODIFY acts like an upsert, so '/b' is created with 'abc'. Then '/a' is replaced with 'def'. Deleting a missing path does nothing, so no duplicate group exists at QUERY time.
Input: ([('ADD', '/p1', 'abcdefgh12345678X87654321'), ('ADD', '/p2', 'abcdefghABCDEFGHX87654321'), ('ADD', '/p3', 'abcdefgh12345678X87654321'), ('QUERY',)],)
Expected Output: [[['/p1', '/p3']]]
Explanation: Files '/p1' and '/p2' have the same length and the same beginning and ending segments, so a partial-signature filter would keep them as candidates. Exact content still shows only '/p1' and '/p3' are true duplicates.
Solution
def solution(operations):
import hashlib
def make_candidate(content):
size = len(content)
head = content[:8]
tail = content[-8:]
full_hash = hashlib.sha256(content.encode('utf-8')).hexdigest()
return (size, head, tail, full_hash)
path_meta = {}
candidate_buckets = {}
duplicate_ids = set()
def remove(path):
if path not in path_meta:
return
cand, content = path_meta.pop(path)
exact_map = candidate_buckets[cand]
group = exact_map[content]
group.remove(path)
exact_id = (cand, content)
if len(group) >= 2:
duplicate_ids.add(exact_id)
else:
duplicate_ids.discard(exact_id)
if not group:
del exact_map[content]
duplicate_ids.discard(exact_id)
if not exact_map:
del candidate_buckets[cand]
def add_or_replace(path, content):
remove(path)
cand = make_candidate(content)
exact_map = candidate_buckets.setdefault(cand, {})
group = exact_map.setdefault(content, set())
group.add(path)
exact_id = (cand, content)
if len(group) >= 2:
duplicate_ids.add(exact_id)
path_meta[path] = (cand, content)
answers = []
for op in operations:
kind = op[0]
if kind == 'ADD' or kind == 'MODIFY':
_, path, content = op
add_or_replace(path, content)
elif kind == 'DELETE':
_, path = op
remove(path)
elif kind == 'QUERY':
groups = []
for cand, content in duplicate_ids:
groups.append(sorted(candidate_buckets[cand][content]))
groups.sort(key=lambda group: (group[0], group))
answers.append(groups)
else:
raise ValueError(f'Unknown operation: {kind}')
return answersTime complexity: ADD/MODIFY: O(L) average, DELETE: O(1) average, QUERY: O(G log G + P log P), where L is content length, G is the number of duplicate groups, and P is the total number of paths returned in that query.. Space complexity: O(F + C), where F is the number of indexed files and C is the total stored content length..
Hints
- Keep a forward map from path to its current bucket so an update can remove the old version in O(1) average time before reinserting the new one.
- Use a reverse index keyed by size, partial signature, full hash, and exact content. Also track which exact-content buckets currently have at least two paths so QUERY does not need to rescan every stored file.