Determine if two files are identical

Q: Determine if two files are identical

This is a System Design interview question from Harvey AI for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Determine Whether Two Files Are Identical at Scale

Context

You need a robust, scalable method to decide if two files (potentially very large and located across different directories or machines) are bitwise identical. The solution should minimize CPU, memory, I/O, and network costs while keeping collision risk negligible and supporting a deduplication workflow.

Requirements

Describe a step-by-step approach that covers:

Choice of hashing algorithms (e.g., MD5 vs SHA-256) and the rationale.
How to mitigate collision risk and when to fall back to byte-by-byte verification.
Strategies to stream and hash very large files efficiently.
What metadata (e.g., size, modification time) you would check before hashing.
How to integrate this into a deduplication workflow across directories or machines.

Determine if two files are identical

Determine Whether Two Files Are Identical at Scale

Context

Requirements

Solution

Comments (0)