Delete duplicate files via DFS
Company: Applied Intuition
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
##### Question
LeetCode 609. Find Duplicate File in System – Implement an algorithm (using DFS/backtracking) to delete files with duplicate content in a file directory tree, and discuss how you would optimize the solution to run faster.
https://leetcode.com/problems/find-duplicate-file-in-system/description/
Quick Answer: This question evaluates a candidate's skills in file-system traversal using DFS/backtracking, duplicate detection across hierarchical directories, and algorithmic techniques such as hashing and efficient data-structure use.
You are given a directory tree represented as nested dictionaries. Each directory node has keys 'name' (string), 'dirs' (list of directory nodes), and 'files' (list of file nodes). Each file node has keys 'name' (string) and 'content' (string). Two files are duplicates if their 'content' strings are equal. Build absolute paths starting at the root as '/{root_name}/.../{file_name}'. Return a list of absolute file paths that should be deleted so that only one file per unique content remains. Keep exactly one file per content: the file with the lexicographically smallest absolute path. Return the list of paths to delete sorted lexicographically.
Constraints
- 1 <= total number of directories + files <= 20000
- All 'name' fields are non-empty and do not contain '/'
- File 'content' is a string; two files are duplicates if their 'content' strings are equal
- Directory depth can be large; use iterative DFS to avoid recursion-depth issues
- Return the deletion list sorted lexicographically
- Absolute path format: '/' + root name + '/' + ... + '/' + file name
Hints
- Traverse the directory tree with DFS using an explicit stack to build absolute paths.
- Use a hash map from content to the smallest (lexicographic) path seen so far.
- Collect all (content, path) pairs, then delete all paths that are not the minimal path for their content.
- Sort the resulting deletion list for deterministic output.