Determine identical files ignoring metadata

Q: Determine identical files ignoring metadata

This question evaluates competency in hashing-based file deduplication, collision risk and mitigation, performance trade-offs (CPU vs I/O vs memory), and the design of scalable persistent hash indexes within the System Design domain.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Determine Byte-Identical Files Using Hashing (Ignore Filesystem Metadata)

Context

Assume we care only about the raw file contents (the byte stream read via open/read). Ignore filesystem metadata such as timestamps, permissions, and extended attributes. The goal is to determine if two files have identical bytes and to scale the approach to large repositories.

Tasks

Propose a hashing-based approach to determine byte-identical files. Cover:
- Algorithm choices: cryptographic vs non-cryptographic; e.g., SHA-256 vs MD5.
- A multi-stage verification plan: size check → fast hash → cryptographic hash → optional byte-by-byte confirmation.
- Collision handling and false-positive mitigation.
- Performance trade-offs (CPU vs I/O, memory, parallelism).
Extend to large-scale operation (petabytes of data):
- Whole-file deduplication vs chunked deduplication (fixed-size vs content-defined chunking).
- Streaming/parallel computation.
- Persistent hash index design.
- How a dedicated service would compute, store, and serve comparisons reliably.

Determine identical files ignoring metadata

Quick Overview

Determine Byte-Identical Files Using Hashing (Ignore Filesystem Metadata)

Context

Tasks

Solution

Comments (0)