This question evaluates understanding of scalable file-deduplication and system-design concepts, including content hashing, I/O minimization, hash-collision handling, memory constraints, parallelization, incremental updates, and cross-machine deduplication.

You are given access to a very large file system containing file paths and read access to file contents. Design an algorithm to identify groups of files that are byte-identical (true duplicates) across all directories. Return groups of file paths where each group contains duplicates. Discuss an approach that filters by size first, then uses partial hashing, then full hashing to minimize I/O; explain how to handle hash collisions, memory constraints, and parallelization. Extend the design to support incremental updates as files are added/modified, cross-machine deduplication, and safe replacement with hard links or content-addressed storage.