Implement file deduplication at scale

Q: Implement file deduplication at scale

This is a Coding & Algorithms interview question from Anthropic for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Write a program to deduplicate files in a very large directory tree. Identify groups of identical files without loading entire files into memory. Outline your approach to hashing (e.g., size filter, partial hash, full hash), chunking for very large files, and handling hash collisions. Support a mode that replaces duplicates with hard links (when safe) and a dry-run report of duplicate sets. Explain time and space complexity, how you batch disk I/O, and how you would parallelize across CPU cores or machines.

Implement file deduplication at scale

Comments (0)