Group duplicate files by content
Company: Applied
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: easy
Interview Round: Onsite
## Problem
You are given a list of files in a filesystem. Each file has:
- a full path (string)
- a file size in bytes (integer)
- file contents (conceptually; assume you can stream/read the file when needed)
Two files are **duplicates** if their contents are identical (byte-for-byte), regardless of file name or directory.
**Task:** Return groups of duplicate files. Each group should contain the full paths of files that have identical content, and groups of size 1 should be omitted.
### Constraints / Expectations
- There can be millions of files.
- Reading file contents is expensive.
- Aim to minimize unnecessary content reads / hashing.
### Follow-ups
1. Describe an approach that avoids hashing every file if most files are unique.
2. How would you handle extremely large files (e.g., >GB) without loading them fully into memory?
3. If hash collisions are a concern, how do you confirm duplicates?
### Example
Input files (path, size, content):
- /a/x.txt, 4, "ABCD"
- /b/y.txt, 4, "ABCD"
- /c/z.txt, 4, "WXYZ"
- /d/w.txt, 7, "1234567"
Output:
- ["/a/x.txt", "/b/y.txt"]
Quick Answer: This question evaluates understanding of file deduplication, efficient I/O and hashing strategies, streaming large-file processing, and scalable algorithm design for identifying files with identical byte-for-byte contents.
Given file records [path, size, content], return duplicate path groups for identical contents, omitting singleton groups.
Constraints
- Inputs are Python literals matching the function signature.
- Return a deterministic exact-match value.
Examples
Input: ([("/a/x.txt",4,"ABCD"),("/b/y.txt",4,"ABCD"),("/c/z.txt",4,"WXYZ"),("/d/w.txt",7,"1234567")],)
Expected Output: [['/a/x.txt', '/b/y.txt']]
Explanation: Only the two ABCD files are duplicates.
Input: ([("a",1,"x"),("b",1,"y"),("c",1,"x"),("d",2,"x")],)
Expected Output: [['a', 'c']]
Explanation: Content matches are considered only after size grouping.
Input: ([],)
Expected Output: []
Explanation: No files means no duplicate groups.
Hints
- Group by size before checking contents.
- Return groups in first-seen order for deterministic grading.