Hashing-Based File Identity

What's being tested

This tests whether you can design a file identity mechanism that is correct, scalable, and practical under real I/O constraints. The interviewer is probing for how you reason about content equality, hashing, collision risk, streaming reads, and persistent indexes for deduplication. For a Software Engineer at Harvey, this matters because document-heavy systems often need to avoid duplicate storage, verify uploads, compare versions, and preserve correctness when user-visible legal or business documents are involved. A strong answer balances theoretical guarantees with engineering pragmatism: when to hash, when to compare bytes, how to handle huge files, and how to make the system work across machines.

Core knowledge

File identity should start with a crisp definition: identical by bytes, identical by normalized content, or identical ignoring filesystem metadata like filename, mtime, permissions, owner, and path. Most system design answers should assume byte-for-byte equality unless explicitly asked to normalize.
Metadata prefilters reduce expensive work. Compare file size first; unequal sizes imply unequal byte content. For local files, size plus inode/device can identify hard links. For distributed storage, size is a cheap index key but never a proof of equality.
Cryptographic hashes like SHA-256 or BLAKE3 are standard for content identity because they are collision-resistant. A hash maps arbitrary bytes to fixed width: SHA-256 gives 256 bits, so random collision probability is approximately $p \approx \frac{n^2}{2^{257}}$ for $n$ files.
Non-cryptographic hashes like xxHash, MurmurHash3, or CRC32C are faster but not collision-resistant. They can be useful as early filters or checksums, but for identity in adversarial or correctness-sensitive contexts, prefer SHA-256, SHA-512/256, or BLAKE3.
Collision handling is still part of a complete design. Even with SHA-256, store (size, hash) and, when absolute certainty is required, perform a final byte-by-byte comparison for candidates sharing both size and hash. This is especially relevant for legal documents or adversarial uploads.
Streaming I/O avoids loading large files into memory. Read chunks such as 4 MiB or 8 MiB, update the hash incrementally, and keep memory $O(1)$ per file. In most systems, hashing throughput is bounded by disk or network I/O before CPU, unless using very fast local NVMe or many parallel workers.
Partial hashing can improve dedup search but is not sufficient for identity. For example, hash the first and last 64 KiB plus size to bucket likely duplicates, then compute the full hash only for candidate groups. This is a performance optimization, not a correctness proof.
Persistent hash indexes support scalable deduplication. A common table schema is (tenant_id, size_bytes, content_hash, storage_pointer, ref_count, created_at), with a composite index on (tenant_id, size_bytes, content_hash). Tenant scoping matters when dedup across customers has privacy or policy implications.
Concurrency control is required when two identical files upload simultaneously. Use a database uniqueness constraint on (tenant_id, size_bytes, content_hash) or an atomic compare-and-insert flow. Without this, two workers can compute the same hash and both store duplicate blobs.
Chunk-level deduplication divides files into blocks and hashes each block. Fixed-size chunks are simple but shift poorly when bytes are inserted near the beginning. Content-defined chunking, such as Rabin fingerprinting, handles insertions better but adds complexity and metadata overhead.
Ignoring metadata means reading file contents, not relying on filesystem attributes. Filename, timestamps, extended attributes, permissions, and storage location should be excluded unless the product definition says they are part of identity. Be explicit about symlinks: compare link target bytes, link path, or resolved file contents.
Distributed environments add failure modes around retries and object storage consistency. Compute the hash while streaming upload, store it with the blob in S3-like object storage, and make the database commit the source of truth. Treat upload retry idempotency separately from file content identity.

Worked example

For Determine if two files are identical, a strong candidate starts by clarifying: “Do we mean byte-for-byte identical, and can the files be larger than memory? Are these local files or remote objects? Is adversarial input possible?” Then they declare assumptions: byte equality, potentially large files, and correctness matters more than shaving a few milliseconds.

The skeleton answer has four pillars. First, compare cheap metadata: if size_bytes differs, return not identical. Second, stream both files in chunks and either compare chunks directly or compute hashes incrementally. Third, if using hashes, choose SHA-256 or BLAKE3, not MD5, and explain collision risk. Fourth, handle operational concerns: read errors, permissions, symlinks, and not loading entire files into memory.

A good design decision to flag is direct streaming comparison versus hashing. If you only need to compare two files once, direct chunk-by-chunk comparison can be better: it exits early on the first mismatch and avoids storing any digest. If you will compare the same files against many others later, computing and persisting a content hash pays off.

A polished close would be: “For a one-off local comparison, I’d do size check plus chunked byte comparison. For a deduplication system, I’d compute a strong streaming hash, index by (size, hash), and optionally byte-compare on collision candidates. If I had more time, I’d cover chunk-level deduplication and concurrent upload races.”

A second angle

For Determine identical files ignoring metadata, the same core concept applies, but the key move is separating content identity from file record identity. The candidate should explicitly exclude filename, path, permissions, mtime, owner, and storage-specific metadata from the equality decision. The design then becomes: read the content stream, compute a stable content hash, and compare (size, hash) rather than comparing filesystem stat structures.

The constraints can change if “metadata” includes embedded document metadata, such as PDF author fields or timestamps inside the file bytes. In that case, byte-for-byte hashing is no longer enough; you need a normalization layer that parses the document and emits canonical bytes before hashing. That is a different and riskier design because parsers, canonicalization rules, and format-specific edge cases become part of correctness.

Common pitfalls

Pitfall: Saying “just use MD5 and compare the hash” as a complete answer.

MD5 is fast but broken for collision resistance, and a hash alone is never a mathematical proof of equality. A better answer is size prefilter, streaming SHA-256 or BLAKE3, and final byte comparison if the domain requires absolute certainty.

Pitfall: Ignoring I/O and memory behavior.

A tempting but weak answer is “read both files into memory and compare strings.” That fails for multi-gigabyte files, remote object stores, and high-concurrency services. Strong candidates talk about chunked reads, backpressure, bounded memory, and whether the bottleneck is disk, network, or CPU.

Pitfall: Blurring product semantics with implementation.

When asked to ignore metadata, some candidates still include path, filename, or mtime in their equality check. Others overcorrect and claim two PDFs with different embedded timestamps are identical without defining normalization. The right move is to clarify which metadata is external filesystem metadata versus bytes inside the file format.

Connections

Interviewers may pivot from here into content-addressable storage, deduplication indexes, Merkle trees, object storage consistency, or idempotent uploads. They may also ask for a deeper algorithmic comparison between direct byte comparison, whole-file hashing, partial hashing, and chunk-level hashing.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts