PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Coding & Algorithms/Anthropic

Find Duplicate Files

Last updated: May 5, 2026

Quick Overview

This question evaluates a candidate's skills in file-system traversal, content deduplication strategies, hashing, large-file I/O handling, concurrency, and robustness to errors and file changes during scans.

  • medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Find Duplicate Files

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

Implement a file deduplication tool. You are given a root directory containing many files. Return groups of duplicate files. Two files are duplicates if they have identical content. Your solution should use file size and content hashing to avoid unnecessary full-file comparisons. Requirements: 1. Recursively scan the directory. 2. Group candidate files by file size. 3. For files with the same size, compute content hashes to identify duplicates. 4. Return only groups containing at least two duplicate files. 5. Handle large files without loading entire files into memory at once. 6. Handle errors such as permission failures or files changing during the scan. Follow-up discussion: - Is this workload I/O-bound or CPU-bound? How would you determine that? - How would you process very large files? - How would you scale to millions or billions of files? - How would you parallelize the scan and hashing stages? - How would you support near-realtime duplicate detection as files are created or modified?

Quick Answer: This question evaluates a candidate's skills in file-system traversal, content deduplication strategies, hashing, large-file I/O handling, concurrency, and robustness to errors and file changes during scans.

Related Interview Questions

  • Convert Samples into Event Intervals - Anthropic (medium)
  • Convert State Stream to Events - Anthropic (medium)
  • Build a concurrent web crawler - Anthropic (medium)
  • Implement a Parallel Image Processor - Anthropic (medium)
  • Implement a Batch Image Processor - Anthropic (medium)
Anthropic logo
Anthropic
Mar 2, 2026, 12:00 AM
Software Engineer
Onsite
Coding & Algorithms
3
0

Implement a file deduplication tool.

You are given a root directory containing many files. Return groups of duplicate files. Two files are duplicates if they have identical content.

Your solution should use file size and content hashing to avoid unnecessary full-file comparisons.

Requirements:

  1. Recursively scan the directory.
  2. Group candidate files by file size.
  3. For files with the same size, compute content hashes to identify duplicates.
  4. Return only groups containing at least two duplicate files.
  5. Handle large files without loading entire files into memory at once.
  6. Handle errors such as permission failures or files changing during the scan.

Follow-up discussion:

  • Is this workload I/O-bound or CPU-bound? How would you determine that?
  • How would you process very large files?
  • How would you scale to millions or billions of files?
  • How would you parallelize the scan and hashing stages?
  • How would you support near-realtime duplicate detection as files are created or modified?

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic Coding & Algorithms•Software Engineer Coding & Algorithms
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.