PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates a candidate's skills in file-system traversal, content deduplication strategies, hashing, large-file I/O handling, concurrency, and robustness to errors and file changes during scans.

  • medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Find Duplicate Files

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

Implement a file deduplication tool. You are given a root directory containing many files. Return groups of duplicate files. Two files are duplicates if they have identical content. Your solution should use file size and content hashing to avoid unnecessary full-file comparisons. Requirements: 1. Recursively scan the directory. 2. Group candidate files by file size. 3. For files with the same size, compute content hashes to identify duplicates. 4. Return only groups containing at least two duplicate files. 5. Handle large files without loading entire files into memory at once. 6. Handle errors such as permission failures or files changing during the scan. Follow-up discussion: - Is this workload I/O-bound or CPU-bound? How would you determine that? - How would you process very large files? - How would you scale to millions or billions of files? - How would you parallelize the scan and hashing stages? - How would you support near-realtime duplicate detection as files are created or modified?

Quick Answer: This question evaluates a candidate's skills in file-system traversal, content deduplication strategies, hashing, large-file I/O handling, concurrency, and robustness to errors and file changes during scans.

Implement a duplicate-file detector for a simulated filesystem. You are given a root directory represented as a nested Python dictionary. Each dictionary key is a file or directory name. Node types: - A subdirectory is another dictionary. - A normal file is a tuple `('file', content)`. - An unreadable file is `('error', reason)` and must be skipped. - A file that changes after its size is observed is `('changed', old_content, new_content)`. Use `len(old_content)` for the size-grouping pass, but skip the file during hashing. All file contents are ASCII strings, so `len(content)` is the file size in bytes for this problem. Return groups of duplicate file paths relative to the root, using `/` as the separator. Two files are duplicates if their contents are identical. To avoid unnecessary work: 1. Recursively scan the directory tree. 2. Group candidate files by file size. 3. For files with the same size, compute a content hash. 4. Hash file contents in chunks instead of processing the entire content as one block. 5. Skip unreadable or changed files. 6. Return only groups containing at least two files. Sort each group lexicographically, and sort the final list of groups by each group's first path. Follow-up discussion: Is this workload I/O-bound or CPU-bound? How would you process very large files? How would you scale to millions or billions of files? How would you parallelize scanning and hashing? How would you support near-realtime duplicate detection?

Constraints

  • 0 <= total number of files and directories <= 10^5
  • 0 <= total length of all file contents <= 10^6
  • File and directory names do not contain '/'
  • You may assume SHA-256 collisions do not occur

Examples

Input: {'docs': {'a.txt': ('file', 'hello'), 'b.txt': ('file', 'hello'), 'c.txt': ('file', 'world')}, 'images': {'img1.bin': ('file', '12345'), 'img2.bin': ('file', '12345'), 'img3.bin': ('file', '1234X')}}

Expected Output: [['docs/a.txt', 'docs/b.txt'], ['images/img1.bin', 'images/img2.bin']]

Explanation: There are two duplicate groups. 'world' has the same size as 'hello' but different content, and '1234X' has the same size as '12345' but different content.

Input: {'a.txt': ('file', 'abc'), 'b.txt': ('file', 'abc'), 'c.txt': ('error', 'permission denied'), 'd.txt': ('changed', 'abc', 'abx'), 'sub': {'e.txt': ('file', 'zzz'), 'f.txt': ('file', 'yyy')}}

Expected Output: [['a.txt', 'b.txt']]

Explanation: Unreadable and changed files are skipped. Only 'a.txt' and 'b.txt' remain as duplicates.

Input: {}

Expected Output: []

Explanation: An empty directory tree contains no duplicate files.

Input: {'one.txt': ('file', 'ab'), 'two.txt': ('file', 'cd'), 'sub': {'three.txt': ('file', 'ef')}}

Expected Output: []

Explanation: All files have the same size, but none share identical content.

Input: {'x': {'p.txt': ('file', 'mirror'), 'q.txt': ('file', 'mirror')}, 'y': {'r.txt': ('file', 'mirror'), 's.txt': ('file', 'planet')}}

Expected Output: [['x/p.txt', 'x/q.txt', 'y/r.txt']]

Explanation: Three files contain 'mirror', so they form one duplicate group. 'planet' has the same length but different content.

Input: {'a': ('file', ''), 'b': ('file', ''), 'c': {'d': ('file', 'x')}}

Expected Output: [['a', 'b']]

Explanation: Zero-length files are duplicates if their contents are both empty.

Hints

  1. A file with a unique size cannot have any duplicates, so you do not need to hash it.
  2. Use a recursive DFS to build relative paths, then hash only the files in size groups of at least 2.
Last updated: May 18, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)
  • Implement a Least-Recently-Used Cache - Anthropic (medium)
  • Convert Samples into Event Intervals - Anthropic (medium)
  • Convert State Stream to Events - Anthropic (medium)