Determine Byte-Identical Files Using Hashing (Ignore Filesystem Metadata)
Context
Assume we care only about the raw file contents (the byte stream read via open/read). Ignore filesystem metadata such as timestamps, permissions, and extended attributes. The goal is to determine if two files have identical bytes and to scale the approach to large repositories.
Tasks
-
Propose a hashing-based approach to determine byte-identical files. Cover:
-
Algorithm choices: cryptographic vs non-cryptographic; e.g., SHA-256 vs MD5.
-
A multi-stage verification plan: size check → fast hash → cryptographic hash → optional byte-by-byte confirmation.
-
Collision handling and false-positive mitigation.
-
Performance trade-offs (CPU vs I/O, memory, parallelism).
-
Extend to large-scale operation (petabytes of data):
-
Whole-file deduplication vs chunked deduplication (fixed-size vs content-defined chunking).
-
Streaming/parallel computation.
-
Persistent hash index design.
-
How a dedicated service would compute, store, and serve comparisons reliably.