Implement file deduplication at scale
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates a candidate's competence in designing and implementing scalable file deduplication systems, covering algorithmic hashing strategies, efficient disk I/O and batching, parallelization across cores or machines, large-file chunking, and safe filesystem operations like hard-linking.
Constraints
- 0 <= len(files) <= 200000
- 0 <= total number of characters across all chunks <= 1000000
- Each `path` is unique
- `mode` is either `'report'` or `'link'`
Examples
Input: ([('/a', 1, 100, ['hello', 'world']), ('/b', 1, 101, ['helloworld']), ('/c', 2, 200, ['hello', 'world']), ('/d', 1, 102, ['hello', 'there'])], 'link')
Expected Output: ([['/a', '/b', '/c']], [('/b', '/a')])