Detect duplicate files efficiently

Q: Detect duplicate files efficiently

This is a Coding & Algorithms interview question from Anthropic for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

You are given access to a very large file system containing file paths and read access to file contents. Design an algorithm to identify groups of files that are byte-identical (true duplicates) across all directories. Return groups of file paths where each group contains duplicates. Discuss an approach that filters by size first, then uses partial hashing, then full hashing to minimize I/O; explain how to handle hash collisions, memory constraints, and parallelization. Extend the design to support incremental updates as files are added/modified, cross-machine deduplication, and safe replacement with hard links or content-addressed storage.

Detect duplicate files efficiently

Comments (0)