Determine Whether Two Files Are Identical at Scale
Context
You need a robust, scalable method to decide if two files (potentially very large and located across different directories or machines) are bitwise identical. The solution should minimize CPU, memory, I/O, and network costs while keeping collision risk negligible and supporting a deduplication workflow.
Requirements
Describe a step-by-step approach that covers:
-
Choice of hashing algorithms (e.g., MD5 vs SHA-256) and the rationale.
-
How to mitigate collision risk and when to fall back to byte-by-byte verification.
-
Strategies to stream and hash very large files efficiently.
-
What metadata (e.g., size, modification time) you would check before hashing.
-
How to integrate this into a deduplication workflow across directories or machines.