This question evaluates competency in hashing-based file deduplication, collision risk and mitigation, performance trade-offs (CPU vs I/O vs memory), and the design of scalable persistent hash indexes within the System Design domain.
Assume we care only about the raw file contents (the byte stream read via open/read). Ignore filesystem metadata such as timestamps, permissions, and extended attributes. The goal is to determine if two files have identical bytes and to scale the approach to large repositories.
Login required