File Deduplication And Content Hashing
Asked of: Software Engineer
Last updated

What's being tested
This tests content-based duplicate detection under real filesystem constraints: recursive traversal, streaming I/O, hashing, collision handling, and memory-aware grouping. Strong answers show a staged algorithm that avoids reading every byte unnecessarily while still proving duplicates by content.
Patterns & templates
-
Recursive filesystem traversal with
os.walk,scandir, or explicit stack —O(files + dirs)metadata pass; handle permissions, symlinks, and cycles. -
Size-first bucketing — group by file size before hashing; files with unique sizes cannot be duplicates, reducing I/O dramatically.
-
Partial hash then full hash — hash first/last chunks before full content; improves average case while preserving final exact verification.
-
Streaming hash computation using
sha256.update(chunk)—O(total_bytes)time,O(chunk_size)memory; never load large files fully. -
Collision-safe comparison — hash groups identify candidates, then byte-compare files or use cryptographic hashes plus optional verification.
-
Chunk-based deduplication for large files — fixed-size or content-defined chunking with rolling hashes; useful when files share regions but differ globally.
-
Parallel I/O pipeline — worker pool for hashing candidate buckets; bound concurrency to avoid disk thrashing and excessive open file descriptors.
Common pitfalls
Pitfall: Hashing every file immediately ignores the easy
size -> candidates -> hash -> verifypruning pipeline and wastes I/O.
Pitfall: Treating hashes as proof of equality without discussing collisions is incomplete; mention cryptographic hashes and final byte comparison.
Pitfall: Following symlinks blindly can create cycles or duplicate paths to the same inode; track
(device, inode)when needed.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Featured in interview prep guides
Practice questions
- Find Duplicate FilesAnthropic · Software Engineer · Onsite · medium
- Find duplicate files and apply image operationsAnthropic · Software Engineer · Technical Screen · hard
- Implement file deduplication at scaleAnthropic · Software Engineer · Onsite · Medium
- Design file deduplication algorithmAnthropic · Software Engineer · Technical Screen · Medium
- Design production-ready dedup serviceAnthropic · Software Engineer · Technical Screen · hard
- Detect duplicate files efficientlyAnthropic · Software Engineer · Onsite · Medium
- Implement file deduplication at scaleAnthropic · Software Engineer · Onsite · Medium
- Design high-throughput hashing for kernelsAnthropic · Software Engineer · Onsite · Medium
- Design file deduplication across nested directoriesAnthropic · Software Engineer · Technical Screen · Medium
- Identify and mitigate deduplication program risksAnthropic · Software Engineer · Technical Screen · hard
- Implement crawler, dedup, and persistent LRUAnthropic · Software Engineer · Onsite · Medium
- Detect duplicate files by contentAnthropic · Software Engineer · Onsite · Medium
Related concepts
- Hashing-Based File IdentitySystem Design
- In-Memory File SystemCoding & Algorithms
- Consistent HashingCoding & Algorithms
- Arrays, Strings, Hash Maps, And Frequency CountingCoding & Algorithms
- Storage, Indexing, APIs, And Secure ExecutionSystem Design
- Hash Map Counting And Frequency AnalysisCoding & Algorithms