This question evaluates understanding of system design and practical engineering competencies required for building resilient file deduplication tools, covering filesystem semantics, error handling, hashing and collision considerations, concurrency, performance bottlenecks, cross-platform interoperability, and safety trade-offs.

You are designing a file deduplication tool that scans large directory trees on one or more filesystems, identifies duplicate files by content, and safely deduplicates them (e.g., by deleting extra copies, replacing them with hard links, or using copy-on-write reflinks). The tool must be resilient to real-world issues.
What problems can a file deduplication program encounter in real-world use, and how would you mitigate them? Address each of the following, providing concrete strategies and trade-offs:
Assume a cross-platform target (Linux/macOS/Windows) unless noted otherwise. Include implementation-oriented guidance (APIs, flags, data structures) where helpful.
Login required