PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Harvey AI

Determine identical files ignoring metadata

Last updated: Jun 15, 2026

Quick Overview

A Harvey AI system design screen on determining whether two files are byte-identical while ignoring metadata, then scaling to deduplication. It covers hashing algorithm choice (SHA-256/BLAKE3 vs MD5/xxHash), multi-stage verification, collision and false-positive mitigation, streaming and parallel hashing, content-defined chunking, and persistent hash-index/service design.

  • hard
  • Harvey AI
  • System Design
  • Software Engineer

Determine identical files ignoring metadata

Company: Harvey AI

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

##### Question How would you determine whether two files contain identical bytes while ignoring all filesystem metadata? Propose and justify a hashing-based approach, then extend it to large-scale, cross-machine deduplication. Address the following: 1. **Algorithm choice and rationale.** Cryptographic versus non-cryptographic hashing (e.g., SHA-256 vs MD5 vs BLAKE3 vs xxHash); why you would or would not use each, and which is authoritative. 2. **Multi-stage verification.** A pipeline that combines a size check, an optional fast prefilter hash, an authoritative cryptographic hash, and an optional byte-by-byte confirmation — and when each stage runs. 3. **Collision handling and false-positive mitigation.** Birthday-bound reasoning, when collisions matter, and how to defend against adversarial (chosen-prefix) inputs versus non-adversarial data. 4. **Metadata to check before hashing.** What attributes (size, mtime/ctime, inode/device id, file type) you would use to prefilter, detect hard links, and cache digests to avoid rehashing unchanged files. 5. **Efficient streaming for very large files.** Chunked/streaming reads, buffer sizing, intra-file parallelism (e.g., Merkle trees / BLAKE3), sparse-file handling, and TOCTOU guardrails for files that change mid-read. 6. **Large-scale deduplication.** Whole-file vs fixed-size vs content-defined chunking (CDC); persistent hash index design; and how a dedicated service would compute, store, and serve comparisons reliably across directories and machines. 7. **Performance trade-offs.** Where the bottleneck lies (I/O vs CPU vs memory) and how your design responds to it.

Quick Answer: A Harvey AI system design screen on determining whether two files are byte-identical while ignoring metadata, then scaling to deduplication. It covers hashing algorithm choice (SHA-256/BLAKE3 vs MD5/xxHash), multi-stage verification, collision and false-positive mitigation, streaming and parallel hashing, content-defined chunking, and persistent hash-index/service design.

Related Interview Questions

  • Design a production file storage service - Harvey AI (hard)
Harvey AI logo
Harvey AI
Sep 6, 2025, 12:00 AM
Software Engineer
Technical Screen
System Design
43
0
Question

How would you determine whether two files contain identical bytes while ignoring all filesystem metadata? Propose and justify a hashing-based approach, then extend it to large-scale, cross-machine deduplication.

Address the following:

  1. Algorithm choice and rationale. Cryptographic versus non-cryptographic hashing (e.g., SHA-256 vs MD5 vs BLAKE3 vs xxHash); why you would or would not use each, and which is authoritative.
  2. Multi-stage verification. A pipeline that combines a size check, an optional fast prefilter hash, an authoritative cryptographic hash, and an optional byte-by-byte confirmation — and when each stage runs.
  3. Collision handling and false-positive mitigation. Birthday-bound reasoning, when collisions matter, and how to defend against adversarial (chosen-prefix) inputs versus non-adversarial data.
  4. Metadata to check before hashing. What attributes (size, mtime/ctime, inode/device id, file type) you would use to prefilter, detect hard links, and cache digests to avoid rehashing unchanged files.
  5. Efficient streaming for very large files. Chunked/streaming reads, buffer sizing, intra-file parallelism (e.g., Merkle trees / BLAKE3), sparse-file handling, and TOCTOU guardrails for files that change mid-read.
  6. Large-scale deduplication. Whole-file vs fixed-size vs content-defined chunking (CDC); persistent hash index design; and how a dedicated service would compute, store, and serve comparisons reliably across directories and machines.
  7. Performance trade-offs. Where the bottleneck lies (I/O vs CPU vs memory) and how your design responds to it.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Harvey AI•More Software Engineer•Harvey AI Software Engineer•Harvey AI System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.