Implement file deduplication at scale

Q: Implement file deduplication at scale

This question evaluates a candidate's competence in designing and implementing scalable file deduplication systems, covering algorithmic hashing strategies, efficient disk I/O and batching, parallelization across cores or machines, large-file chunking, and safe filesystem operations like hard-linking.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Write a program to deduplicate files in a very large directory tree. Identify groups of identical files without loading entire files into memory. Outline your approach to hashing (e.g., size filter, partial hash, full hash), chunking for very large files, and handling hash collisions. Support a mode that replaces duplicates with hard links (when safe) and a dry-run report of duplicate sets. Explain time and space complexity, how you batch disk I/O, and how you would parallelize across CPU cores or machines.

Implement file deduplication at scale

Overview

Comments (0)