Find duplicate files by size
Company: Applied Intuition
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates file parsing, data deduplication and grouping skills along with algorithmic efficiency and space-time trade-offs when identifying duplicate files by size rather than by content.
Constraints
- 0 <= len(paths) <= 20000
- Each directory description has length between 1 and 2000
- The total number of files across all strings is at most 100000
- 0 <= size <= 10^9
- Directory names and file names contain no spaces, and file names do not contain parentheses
Examples
Input: (["root/a 1.txt(100) 2.txt(200) 3.txt(100)", "root/c 4.txt(300)", "root/c/d 4.txt(200)", "root 4.txt(300)"],)
Expected Output: [["root/a/1.txt", "root/a/3.txt"], ["root/a/2.txt", "root/c/d/4.txt"], ["root/c/4.txt", "root/4.txt"]]
Explanation: Size 100 appears in root/a/1.txt and root/a/3.txt, size 200 appears in root/a/2.txt and root/c/d/4.txt, and size 300 appears in root/c/4.txt and root/4.txt.
Input: (["home 1.txt(10) 2.txt(20)", "var 3.log(30)"],)
Expected Output: []
Explanation: Every file size is unique, so there are no duplicate groups.
Input: ([],)
Expected Output: []
Explanation: An empty input has no files and therefore no duplicates.
Input: (["data 0.bin(0)", "backup 1.bin(0) 2.bin(5)", "tmp 3.bin(5)"],)
Expected Output: [["data/0.bin", "backup/1.bin"], ["backup/2.bin", "tmp/3.bin"]]
Explanation: Files of size 0 form one duplicate group, and files of size 5 form another.
Input: (["docs a.txt(7) b.txt(7) c.txt(8)"],)
Expected Output: [["docs/a.txt", "docs/b.txt"]]
Explanation: Within the same directory, a.txt and b.txt share size 7, while c.txt is unique.
Solution
def solution(paths):
size_to_paths = {}
for entry in paths:
parts = entry.split()
if not parts:
continue
directory = parts[0]
for file_info in parts[1:]:
left = file_info.rfind('(')
right = file_info.rfind(')')
name = file_info[:left]
size = int(file_info[left + 1:right])
full_path = directory + '/' + name
if size not in size_to_paths:
size_to_paths[size] = []
size_to_paths[size].append(full_path)
return [group for group in size_to_paths.values() if len(group) > 1]Time complexity: O(T), where T is the total number of characters across all input strings. Space complexity: O(F), where F is the total number of files.
Hints
- Use a hash map where the key is the file size and the value is the list of full paths with that size.
- You do not need to compare every pair of files. Parse each file once, build its full path, and append it to the correct group.