Implement a file deduplication tool.
You are given a root directory containing many files. Return groups of duplicate files. Two files are duplicates if they have identical content.
Your solution should use file size and content hashing to avoid unnecessary full-file comparisons.
Requirements:
-
Recursively scan the directory.
-
Group candidate files by file size.
-
For files with the same size, compute content hashes to identify duplicates.
-
Return only groups containing at least two duplicate files.
-
Handle large files without loading entire files into memory at once.
-
Handle errors such as permission failures or files changing during the scan.
Follow-up discussion:
-
Is this workload I/O-bound or CPU-bound? How would you determine that?
-
How would you process very large files?
-
How would you scale to millions or billions of files?
-
How would you parallelize the scan and hashing stages?
-
How would you support near-realtime duplicate detection as files are created or modified?