PracHub
QuestionsCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates a candidate's competence in designing and implementing scalable file deduplication systems, covering algorithmic hashing strategies, efficient disk I/O and batching, parallelization across cores or machines, large-file chunking, and safe filesystem operations like hard-linking.

  • Medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Implement file deduplication at scale

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

Write a program to deduplicate files in a very large directory tree. Identify groups of identical files without loading entire files into memory. Outline your approach to hashing (e.g., size filter, partial hash, full hash), chunking for very large files, and handling hash collisions. Support a mode that replaces duplicates with hard links (when safe) and a dry-run report of duplicate sets. Explain time and space complexity, how you batch disk I/O, and how you would parallelize across CPU cores or machines.

Quick Answer: This question evaluates a candidate's competence in designing and implementing scalable file deduplication systems, covering algorithmic hashing strategies, efficient disk I/O and batching, parallelization across cores or machines, large-file chunking, and safe filesystem operations like hard-linking.

Implement `solution(files, mode)`, a file-deduplication pipeline that finds files with identical content and, in `link` mode, reports the hard-link operations to deduplicate them. **Input.** `files` is a list of 4-tuples `(path, device_id, inode_id, chunks)`. `path` is unique; `device_id` is the file's device; `inode_id` is its inode (files sharing an `inode_id` are already hard-linked); `chunks` is a list of strings whose concatenation in order is the content (chunks may be empty; the list may be empty). `mode` is `'report'` or `'link'`. Walk the chunks directly; never concatenate a whole file into one big string. **Duplicates.** Two files are duplicates when their reconstructed content is exactly equal (chunk boundaries don't matter: `['hello','world']` equals `['helloworld']`). A duplicate **set** is a maximal collection of 2+ content-equal files; unmatched files aren't reported. **Output.** Return a 2-tuple `(groups, links)`. `groups`: a list of duplicate sets; each set is its paths sorted lexicographically; only sets of 2+ are included; sets are ordered by their smallest (first) path. `links`: `[]` in `'report'` mode. In `'link'` mode, per set, handle each device separately: links are valid only within the same `device_id`. On a device, the lexicographically smallest path among that set's files there is the canonical target; for every other file on that device, emit `(path, canonical_path)` only if its `inode_id` differs from the canonical's. Sort the final `links` ascending. **Constraints.** `0 <= len(files) <= 200000`; `0 <= total characters across all chunks <= 1000000`; each `path` is unique; `mode` is `'report'` or `'link'`. **Example.** `[('/a',1,100,['hello','world']),('/b',1,101,['helloworld']),('/c',2,200,['hello','world']),('/d',1,102,['hello','there'])]` in `'link'` mode returns `([['/a','/b','/c']], [('/b','/a')])`.

Constraints

  • 0 <= len(files) <= 200000
  • 0 <= total number of characters across all chunks <= 1000000
  • Each `path` is unique
  • `mode` is either `'report'` or `'link'`

Examples

Input: ([('/a', 1, 100, ['hello', 'world']), ('/b', 1, 101, ['helloworld']), ('/c', 2, 200, ['hello', 'world']), ('/d', 1, 102, ['hello', 'there'])], 'link')

Expected Output: ([['/a', '/b', '/c']], [('/b', '/a')])

Explanation: Files /a, /b, and /c have identical content even though /a and /b use different chunk boundaries. Only /b can be hard-linked to /a because /c is on a different device.

Input: ([('/e', 1, 10, []), ('/f', 1, 11, []), ('/g', 1, 20, ['x']), ('/h', 1, 20, ['x']), ('/i', 1, 21, ['x'])], 'link')

Expected Output: ([['/e', '/f'], ['/g', '/h', '/i']], [('/f', '/e'), ('/i', '/g')])

Explanation: Empty files /e and /f are duplicates, so /f can link to /e. Files /g, /h, and /i are duplicates; /h already shares /g's inode, so only /i needs a hard-link operation.

Input: ([('/m1', 1, 1, ['PPPPPPPPPPPPPPPP', 'A', 'SSSSSSSSSSSSSSSS']), ('/m2', 1, 2, ['PPPPPPPPPPPPPPPPB', 'SSSSSSSSSSSSSSSS']), ('/m3', 1, 3, ['PPPPPPPPPPPPPPPPASSSSSSSSSSSSSSSS'])], 'report')

Expected Output: ([['/m1', '/m3']], [])

Explanation: All three files have the same size and the same first and last 16 characters, so a cheap partial fingerprint alone would collide. Full hashing and final verification show that only /m1 and /m3 are truly identical.

Input: ([], 'report')

Expected Output: ([], [])

Explanation: With no files, there are no duplicate groups and no link operations.

Input: ([('/z2', 2, 2, ['dup']), ('/z1', 1, 1, ['dup']), ('/z3', 1, 3, ['dup']), ('/z4', 2, 4, ['dup'])], 'link')

Expected Output: ([['/z1', '/z2', '/z3', '/z4']], [('/z3', '/z1'), ('/z4', '/z2')])

Explanation: All four files are duplicates, but hard links must be planned per device. On device 1, /z1 is the canonical target; on device 2, /z2 is the canonical target.

Hints

  1. Use increasingly expensive filters: size, then a small content-based prefix/suffix fingerprint, then a full streaming hash.
  2. A matching full hash is not enough for absolute correctness. Compare the actual streamed contents inside a full-hash bucket to handle collisions and different chunk boundaries.
Last updated: Apr 20, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement a Banking System - Anthropic (medium)
  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement Task Management and Duplicate Detection - Anthropic (medium)