Detect duplicate files by content

Q: Detect duplicate files by content

This question evaluates filesystem traversal, efficient I/O and memory use, hashing strategies for content comparison, duplicate-detection algorithms, and time/space complexity analysis.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Detect duplicate files by content in a filesystem. Given access to a directory tree, return groups of file paths that have identical byte content. Minimize disk I/O by first grouping by file size, then using hashing (e.g., fast hash followed by cryptographic hash on collisions) and final byte-by-byte verification. Handle very large files via streaming reads, permission errors, and symbolic links. Provide working code and analyze time/space complexity.

Detect duplicate files by content

Overview

Comments (0)