Find Duplicate Files by Content
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Onsite
Quick Answer: This question evaluates the ability to parse file-system-style input, group files by identical content, and manage deduplication using appropriate data structures and algorithms.
Constraints
- 1 <= len(descriptions) <= 10^4
- The total number of files across all descriptions is at most 10^5
- Directory names and file names do not contain spaces
- File content does not contain parentheses
Examples
Input: (['/data/a 1.txt(abcd) 2.txt(efgh)', '/data/b 3.txt(abcd)', '/data/c 4.txt(efgh) 5.txt(xyz)'],)
Expected Output: [['/data/a/1.txt', '/data/b/3.txt'], ['/data/a/2.txt', '/data/c/4.txt']]
Explanation: Files with content 'abcd' are '/data/a/1.txt' and '/data/b/3.txt'. Files with content 'efgh' are '/data/a/2.txt' and '/data/c/4.txt'. 'xyz' appears only once, so it is excluded.
Input: (['/docs a.txt(hello) b.txt(world)'],)
Expected Output: []
Explanation: Both files have unique contents, so there are no duplicate groups.
Input: (['/root a.txt(z) b.txt(z) c.txt(y)', '/other d.txt(y) e.txt(z)'],)
Expected Output: [['/other/d.txt', '/root/c.txt'], ['/other/e.txt', '/root/a.txt', '/root/b.txt']]
Explanation: Content 'y' appears in '/root/c.txt' and '/other/d.txt'. Content 'z' appears in '/root/a.txt', '/root/b.txt', and '/other/e.txt'. Each group is sorted, then the groups are sorted by their first path.
Input: (['/a 1.txt(x)', '/b 2.txt(x)', '/c 3.txt(x)', '/d 4.txt(y)'],)
Expected Output: [['/a/1.txt', '/b/2.txt', '/c/3.txt']]
Explanation: Content 'x' appears in three files, so they form one duplicate group. Content 'y' appears only once and is ignored.
Hints
- A hash map from file content to a list of full file paths lets you collect duplicates efficiently.
- For each file token like '1.txt(abcd)', find the first '(' to split the file name from its content.