Implement file deduplication at scale
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates a candidate's ability to design and implement scalable file deduplication tools, covering competencies in file-system and OS APIs, efficient hashing and large-file handling, I/O and memory trade-offs, and handling filesystem semantics such as symlinks, permissions, and error cases.
Constraints
- 0 <= len(entries) <= 100000
- 0 <= prefix_bytes <= 4096
- 1 <= chunk_size <= 4096
- The total number of bytes across readable regular files is at most 10^7
- For valid readable file records, 'content' is a bytes object and 'path' values are unique
Examples
Input: ([{'path': '/a/report.txt', 'type': 'file', 'readable': True, 'content': b'hello world'}, {'path': '/b/copy.txt', 'type': 'file', 'readable': True, 'content': b'hello world'}, {'path': '/c/other.txt', 'type': 'file', 'readable': True, 'content': b'HELLO WORLD'}, {'path': '/d/link', 'type': 'symlink', 'readable': True, 'content': b'hello world'}], 4, 3)
Expected Output: [['/a/report.txt', '/b/copy.txt']]
Explanation: The two readable regular files with identical content are duplicates. The symlink is ignored, and '/c/other.txt' differs by content.
Input: ([{'path': '/x/a.bin', 'type': 'file', 'readable': True, 'content': b'abcdef12'}, {'path': '/x/b.bin', 'type': 'file', 'readable': True, 'content': b'abcdef34'}, {'path': '/x/c.bin', 'type': 'file', 'readable': True, 'content': b'abcdef56'}, {'path': '/x/private.bin', 'type': 'file', 'readable': False, 'content': b'abcdef12'}], 6, 2)
Expected Output: []
Explanation: All readable files have the same size and first 6 bytes, but their full contents differ, so no duplicate group should be reported.
Input: ([{'path': '/empty/one', 'type': 'file', 'readable': True, 'content': b''}, {'path': '/empty/two', 'type': 'file', 'readable': True, 'content': b''}, {'path': '/empty/skip', 'type': 'file', 'readable': False, 'content': b''}, {'path': '/empty/dir', 'type': 'dir', 'readable': True, 'content': b''}], 8, 4)
Expected Output: [['/empty/one', '/empty/two']]
Explanation: Empty readable regular files are duplicates of each other. The unreadable file and directory are ignored.
Input: ([{'path': '/z/f1.txt', 'type': 'file', 'readable': True, 'content': b'same1'}, {'path': '/a/f2.txt', 'type': 'file', 'readable': True, 'content': b'same1'}, {'path': '/m/f3.txt', 'type': 'file', 'readable': True, 'content': b'data'}, {'path': '/b/f4.txt', 'type': 'file', 'readable': True, 'content': b'data'}, {'path': '/c/f5.txt', 'type': 'file', 'readable': True, 'content': b'unique'}], 2, 2)
Expected Output: [['/a/f2.txt', '/z/f1.txt'], ['/b/f4.txt', '/m/f3.txt']]
Explanation: There are two duplicate groups: one for content 'same1' and one for content 'data'. Paths inside groups and the groups themselves are sorted lexicographically.
Input: ([], 8, 4)
Expected Output: []
Explanation: With no entries, there are no duplicates.
Hints
- Files with different sizes can never be duplicates, so size is a very strong first filter.
- Inside a candidate bucket, use a hash map from digest to paths so you avoid O(k^2) pairwise comparisons.