Problem
You are given a list of files in a filesystem. Each file has:
-
a full path (string)
-
a file size in bytes (integer)
-
file contents (conceptually; assume you can stream/read the file when needed)
Two files are duplicates if their contents are identical (byte-for-byte), regardless of file name or directory.
Task: Return groups of duplicate files. Each group should contain the full paths of files that have identical content, and groups of size 1 should be omitted.
Constraints / Expectations
-
There can be millions of files.
-
Reading file contents is expensive.
-
Aim to minimize unnecessary content reads / hashing.
Follow-ups
-
Describe an approach that avoids hashing every file if most files are unique.
-
How would you handle extremely large files (e.g., >GB) without loading them fully into memory?
-
If hash collisions are a concern, how do you confirm duplicates?
Example
Input files (path, size, content):
-
/a/x.txt, 4, "ABCD"
-
/b/y.txt, 4, "ABCD"
-
/c/z.txt, 4, "WXYZ"
-
/d/w.txt, 7, "1234567"
Output: