This question evaluates a candidate's proficiency with file deduplication algorithms, hashing and signature strategies, I/O-efficient streaming, in-memory indexing, collision detection, and handling file metadata such as permissions and timestamps.
Your filesystem contains millions of photos. Duplicates are strictly byte-identical files (no ML/CV similarity). Design an algorithm to detect and delete duplicates efficiently on a single machine. Specify: how you compute and store per-file signatures (e.g., full hash vs size+partial+full, streaming I/O); how the in-memory key–value store maps signatures to canonical file paths; how you handle hash collisions and verification before deletion; how you treat files with identical names but different content, permissions, or timestamps; big-O time and space complexity and I/O considerations; and provide pseudocode for a function that returns the set of file paths safe to delete.