Implement file deduplication at scale

Q: Implement file deduplication at scale

This question evaluates a candidate's ability to design and implement scalable file deduplication tools, covering competencies in file-system and OS APIs, efficient hashing and large-file handling, I/O and memory trade-offs, and handling filesystem semantics such as symlinks, permissions, and error cases.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Implement a command-line tool to find duplicate files in a directory tree. Use OS/pathlib primitives to recursively enumerate files. Apply prefilters (e.g., group by file size and/or first N bytes) to narrow candidates, then confirm duplicates with a strong hash (chunked for large files). Discuss handling of large files, memory and I/O trade-offs, symlinks and permissions, error cases, and how to output groups of duplicates with options to hard-link, move, or delete. Analyze time and space complexity.

Implement file deduplication at scale

Overview

Comments (0)