System Design: Robust File Deduplication in the Real World
Context
You are designing a file deduplication tool that scans large directory trees on one or more filesystems, identifies duplicate files by content, and safely deduplicates them (e.g., by deleting extra copies, replacing them with hard links, or using copy-on-write reflinks). The tool must be resilient to real-world issues.
Prompt
What problems can a file deduplication program encounter in real-world use, and how would you mitigate them? Address each of the following, providing concrete strategies and trade-offs:
-
Permission and I/O errors
-
Files changing during the scan (TOCTOU)
-
Path length limits and path normalization
-
Symlink loops and mount/device boundaries
-
Memory and disk bottlenecks while indexing and hashing
-
Hash function choice and collision risks
-
Partial vs. full hashing and final verification
-
Concurrency, scheduling, and throttling
-
Incremental rescans and checkpointing
-
Safe dedupe actions (verification before deletion, hard links vs. reflinks, metadata/ACLs)
Assume a cross-platform target (Linux/macOS/Windows) unless noted otherwise. Include implementation-oriented guidance (APIs, flags, data structures) where helpful.