Implement file deduplication at scale
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates a candidate's competence in designing and implementing scalable file deduplication systems, covering algorithmic hashing strategies, efficient disk I/O and batching, parallelization across cores or machines, large-file chunking, and safe filesystem operations like hard-linking.
Constraints
- 0 <= len(files) <= 200000
- 0 <= total number of characters across all chunks <= 1000000
- Each `path` is unique
- `mode` is either `'report'` or `'link'`
Examples
Input: ([('/a', 1, 100, ['hello', 'world']), ('/b', 1, 101, ['helloworld']), ('/c', 2, 200, ['hello', 'world']), ('/d', 1, 102, ['hello', 'there'])], 'link')
Expected Output: ([['/a', '/b', '/c']], [('/b', '/a')])
Explanation: Files /a, /b, and /c have identical content even though /a and /b use different chunk boundaries. Only /b can be hard-linked to /a because /c is on a different device.
Input: ([('/e', 1, 10, []), ('/f', 1, 11, []), ('/g', 1, 20, ['x']), ('/h', 1, 20, ['x']), ('/i', 1, 21, ['x'])], 'link')
Expected Output: ([['/e', '/f'], ['/g', '/h', '/i']], [('/f', '/e'), ('/i', '/g')])
Explanation: Empty files /e and /f are duplicates, so /f can link to /e. Files /g, /h, and /i are duplicates; /h already shares /g's inode, so only /i needs a hard-link operation.
Input: ([('/m1', 1, 1, ['PPPPPPPPPPPPPPPP', 'A', 'SSSSSSSSSSSSSSSS']), ('/m2', 1, 2, ['PPPPPPPPPPPPPPPPB', 'SSSSSSSSSSSSSSSS']), ('/m3', 1, 3, ['PPPPPPPPPPPPPPPPASSSSSSSSSSSSSSSS'])], 'report')
Expected Output: ([['/m1', '/m3']], [])
Explanation: All three files have the same size and the same first and last 16 characters, so a cheap partial fingerprint alone would collide. Full hashing and final verification show that only /m1 and /m3 are truly identical.
Input: ([], 'report')
Expected Output: ([], [])
Explanation: With no files, there are no duplicate groups and no link operations.
Input: ([('/z2', 2, 2, ['dup']), ('/z1', 1, 1, ['dup']), ('/z3', 1, 3, ['dup']), ('/z4', 2, 4, ['dup'])], 'link')
Expected Output: ([['/z1', '/z2', '/z3', '/z4']], [('/z3', '/z1'), ('/z4', '/z2')])
Explanation: All four files are duplicates, but hard links must be planned per device. On device 1, /z1 is the canonical target; on device 2, /z2 is the canonical target.
Hints
- Use increasingly expensive filters: size, then a small content-based prefix/suffix fingerprint, then a full streaming hash.
- A matching full hash is not enough for absolute correctness. Compare the actual streamed contents inside a full-hash bucket to handle collisions and different chunk boundaries.