Detect duplicate files efficiently

Q: Detect duplicate files efficiently

This question evaluates understanding of scalable file-deduplication and system-design concepts, including content hashing, I/O minimization, hash-collision handling, memory constraints, parallelization, incremental updates, and cross-machine deduplication.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

You are given access to a very large file system containing file paths and read access to file contents. Design an algorithm to identify groups of files that are byte-identical (true duplicates) across all directories. Return groups of file paths where each group contains duplicates. Discuss an approach that filters by size first, then uses partial hashing, then full hashing to minimize I/O; explain how to handle hash collisions, memory constraints, and parallelization. Extend the design to support incremental updates as files are added/modified, cross-machine deduplication, and safe replacement with hard links or content-addressed storage.

Detect duplicate files efficiently

Overview

Comments (0)