Find Duplicate Files
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
Quick Answer: This question evaluates a candidate's skills in file-system traversal, content deduplication strategies, hashing, large-file I/O handling, concurrency, and robustness to errors and file changes during scans.
Constraints
- 0 <= total number of files and directories <= 10^5
- 0 <= total length of all file contents <= 10^6
- File and directory names do not contain '/'
- You may assume SHA-256 collisions do not occur
Examples
Input: {'docs': {'a.txt': ('file', 'hello'), 'b.txt': ('file', 'hello'), 'c.txt': ('file', 'world')}, 'images': {'img1.bin': ('file', '12345'), 'img2.bin': ('file', '12345'), 'img3.bin': ('file', '1234X')}}
Expected Output: [['docs/a.txt', 'docs/b.txt'], ['images/img1.bin', 'images/img2.bin']]
Explanation: There are two duplicate groups. 'world' has the same size as 'hello' but different content, and '1234X' has the same size as '12345' but different content.
Input: {'a.txt': ('file', 'abc'), 'b.txt': ('file', 'abc'), 'c.txt': ('error', 'permission denied'), 'd.txt': ('changed', 'abc', 'abx'), 'sub': {'e.txt': ('file', 'zzz'), 'f.txt': ('file', 'yyy')}}
Expected Output: [['a.txt', 'b.txt']]
Explanation: Unreadable and changed files are skipped. Only 'a.txt' and 'b.txt' remain as duplicates.
Input: {}
Expected Output: []
Explanation: An empty directory tree contains no duplicate files.
Input: {'one.txt': ('file', 'ab'), 'two.txt': ('file', 'cd'), 'sub': {'three.txt': ('file', 'ef')}}
Expected Output: []
Explanation: All files have the same size, but none share identical content.
Input: {'x': {'p.txt': ('file', 'mirror'), 'q.txt': ('file', 'mirror')}, 'y': {'r.txt': ('file', 'mirror'), 's.txt': ('file', 'planet')}}
Expected Output: [['x/p.txt', 'x/q.txt', 'y/r.txt']]
Explanation: Three files contain 'mirror', so they form one duplicate group. 'planet' has the same length but different content.
Input: {'a': ('file', ''), 'b': ('file', ''), 'c': {'d': ('file', 'x')}}
Expected Output: [['a', 'b']]
Explanation: Zero-length files are duplicates if their contents are both empty.
Hints
- A file with a unique size cannot have any duplicates, so you do not need to hash it.
- Use a recursive DFS to build relative paths, then hash only the files in size groups of at least 2.