Design file deduplication at scale

Q: Design file deduplication at scale

This is a Coding & Algorithms interview question from HubSpot for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Design an algorithm to identify duplicate files in a large directory tree. You are given an iterator over files providing (path, size) and a function read_chunks(path) -> Iterator[bytes]. Requirements: minimize I/O by comparing sizes and using rolling or cryptographic hashes; handle hash collisions safely; support datasets that do not fit in memory; and output groups of paths that are byte-for-byte identical. Explain time/space trade-offs and how you would parallelize the solution.

Design file deduplication at scale

Comments (0)