Delete duplicate files via DFS
Company: Applied Intuition
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's skills in file-system traversal using DFS/backtracking, duplicate detection across hierarchical directories, and algorithmic techniques such as hashing and efficient data-structure use.
Constraints
- 1 <= total number of directories + files <= 20000
- All 'name' fields are non-empty and do not contain '/'
- File 'content' is a string; two files are duplicates if their 'content' strings are equal
- Directory depth can be large; use iterative DFS to avoid recursion-depth issues
- Return the deletion list sorted lexicographically
- Absolute path format: '/' + root name + '/' + ... + '/' + file name
Solution
def delete_duplicate_files(tree: dict) -> list:
"""Return absolute paths of files to delete to keep one per content.
The kept file per content is the one with the lexicographically smallest absolute path.
Input tree format:
{
"name": str,
"dirs": [dir_node, ...],
"files": [{"name": str, "content": str}, ...]
}
"""
if not tree or "name" not in tree:
return []
stack = [(tree, "/" + tree["name"])]
files = [] # list of (content, path)
min_path = {} # content -> smallest path
while stack:
node, cur = stack.pop()
for f in node.get("files", []):
path = cur + "/" + f["name"]
content = f.get("content", "")
files.append((content, path))
if content not in min_path or path < min_path[content]:
min_path[content] = path
for d in node.get("dirs", []):
stack.append((d, cur + "/" + d["name"]))
deletions = [path for content, path in files if path != min_path[content]]
deletions.sort()
return deletions
Explanation
Time complexity: O(n log n), where n is the number of files (sorting the deletion list dominates).. Space complexity: O(n) for storing file paths and the content-to-min-path map..
Hints
- Traverse the directory tree with DFS using an explicit stack to build absolute paths.
- Use a hash map from content to the smallest (lexicographic) path seen so far.
- Collect all (content, path) pairs, then delete all paths that are not the minimal path for their content.
- Sort the resulting deletion list for deterministic output.