Determine identical files ignoring metadata
Company: Harvey AI
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
##### Question
How would you determine whether two files contain identical bytes while ignoring all filesystem metadata? Propose and justify a hashing-based approach, then extend it to large-scale, cross-machine deduplication.
Address the following:
1. **Algorithm choice and rationale.** Cryptographic versus non-cryptographic hashing (e.g., SHA-256 vs MD5 vs BLAKE3 vs xxHash); why you would or would not use each, and which is authoritative.
2. **Multi-stage verification.** A pipeline that combines a size check, an optional fast prefilter hash, an authoritative cryptographic hash, and an optional byte-by-byte confirmation — and when each stage runs.
3. **Collision handling and false-positive mitigation.** Birthday-bound reasoning, when collisions matter, and how to defend against adversarial (chosen-prefix) inputs versus non-adversarial data.
4. **Metadata to check before hashing.** What attributes (size, mtime/ctime, inode/device id, file type) you would use to prefilter, detect hard links, and cache digests to avoid rehashing unchanged files.
5. **Efficient streaming for very large files.** Chunked/streaming reads, buffer sizing, intra-file parallelism (e.g., Merkle trees / BLAKE3), sparse-file handling, and TOCTOU guardrails for files that change mid-read.
6. **Large-scale deduplication.** Whole-file vs fixed-size vs content-defined chunking (CDC); persistent hash index design; and how a dedicated service would compute, store, and serve comparisons reliably across directories and machines.
7. **Performance trade-offs.** Where the bottleneck lies (I/O vs CPU vs memory) and how your design responds to it.
Quick Answer: A Harvey AI system design screen on determining whether two files are byte-identical while ignoring metadata, then scaling to deduplication. It covers hashing algorithm choice (SHA-256/BLAKE3 vs MD5/xxHash), multi-stage verification, collision and false-positive mitigation, streaming and parallel hashing, content-defined chunking, and persistent hash-index/service design.