File Deduplication And Content Hashing

What's being tested

This tests content-based duplicate detection under real filesystem constraints: recursive traversal, streaming I/O, hashing, collision handling, and memory-aware grouping. Strong answers show a staged algorithm that avoids reading every byte unnecessarily while still proving duplicates by content.

Patterns & templates

Recursive filesystem traversal with os.walk, scandir, or explicit stack — O(files + dirs) metadata pass; handle permissions, symlinks, and cycles.
Size-first bucketing — group by file size before hashing; files with unique sizes cannot be duplicates, reducing I/O dramatically.
Partial hash then full hash — hash first/last chunks before full content; improves average case while preserving final exact verification.
Streaming hash computation using sha256.update(chunk) — O(total_bytes) time, O(chunk_size) memory; never load large files fully.
Collision-safe comparison — hash groups identify candidates, then byte-compare files or use cryptographic hashes plus optional verification.
Chunk-based deduplication for large files — fixed-size or content-defined chunking with rolling hashes; useful when files share regions but differ globally.
Parallel I/O pipeline — worker pool for hashing candidate buckets; bound concurrency to avoid disk thrashing and excessive open file descriptors.

Common pitfalls

Pitfall: Hashing every file immediately ignores the easy size -> candidates -> hash -> verify pruning pipeline and wastes I/O.

Pitfall: Treating hashes as proof of equality without discussing collisions is incomplete; mention cryptographic hashes and final byte comparison.

Pitfall: Following symlinks blindly can create cycles or duplicate paths to the same inode; track (device, inode) when needed.

Practice these

The practice cards below cover the canonical variants — solve all of them and time yourself.

What's being tested

Patterns & templates

Common pitfalls

Practice these

Featured in interview prep guides

Practice questions

Related concepts