PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates the ability to parse file-system-style input, group files by identical content, and manage deduplication using appropriate data structures and algorithms.

  • hard
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Find Duplicate Files by Content

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Onsite

You are given a list of directory descriptions. Each description has the following format: `root_directory file_name_1(content_1) file_name_2(content_2) ...` For example: `/data/a 1.txt(abcd) 2.txt(efgh)` This means the directory `/data/a` contains file `/data/a/1.txt` with content `abcd` and file `/data/a/2.txt` with content `efgh`. Return all groups of file paths whose file contents are identical. Only include groups with at least two files. The order of groups and the order of paths inside each group do not matter. Example: Input: `["/data/a 1.txt(abcd) 2.txt(efgh)", "/data/b 3.txt(abcd)", "/data/c 4.txt(efgh) 5.txt(xyz)"]` Output: `[["/data/a/1.txt", "/data/b/3.txt"], ["/data/a/2.txt", "/data/c/4.txt"]]` Constraints: - `1 <= number of directory descriptions <= 10^4` - The total number of files is at most `10^5`. - File names and directory names do not contain spaces. - File content is represented as a string without parentheses.

Quick Answer: This question evaluates the ability to parse file-system-style input, group files by identical content, and manage deduplication using appropriate data structures and algorithms.

You are given a list of directory descriptions. Each description has the format: root_directory file_name_1(content_1) file_name_2(content_2) ... For example, the description '/data/a 1.txt(abcd) 2.txt(efgh)' means the directory '/data/a' contains the files '/data/a/1.txt' with content 'abcd' and '/data/a/2.txt' with content 'efgh'. Your task is to return all groups of file paths whose file contents are identical. Only include groups that contain at least two files. For this problem, return the result in a deterministic order: sort each group of file paths lexicographically, then sort the list of groups by the first path in each group. If there are no duplicate files, return an empty list.

Constraints

  • 1 <= len(descriptions) <= 10^4
  • The total number of files across all descriptions is at most 10^5
  • Directory names and file names do not contain spaces
  • File content does not contain parentheses

Examples

Input: (['/data/a 1.txt(abcd) 2.txt(efgh)', '/data/b 3.txt(abcd)', '/data/c 4.txt(efgh) 5.txt(xyz)'],)

Expected Output: [['/data/a/1.txt', '/data/b/3.txt'], ['/data/a/2.txt', '/data/c/4.txt']]

Explanation: Files with content 'abcd' are '/data/a/1.txt' and '/data/b/3.txt'. Files with content 'efgh' are '/data/a/2.txt' and '/data/c/4.txt'. 'xyz' appears only once, so it is excluded.

Input: (['/docs a.txt(hello) b.txt(world)'],)

Expected Output: []

Explanation: Both files have unique contents, so there are no duplicate groups.

Input: (['/root a.txt(z) b.txt(z) c.txt(y)', '/other d.txt(y) e.txt(z)'],)

Expected Output: [['/other/d.txt', '/root/c.txt'], ['/other/e.txt', '/root/a.txt', '/root/b.txt']]

Explanation: Content 'y' appears in '/root/c.txt' and '/other/d.txt'. Content 'z' appears in '/root/a.txt', '/root/b.txt', and '/other/e.txt'. Each group is sorted, then the groups are sorted by their first path.

Input: (['/a 1.txt(x)', '/b 2.txt(x)', '/c 3.txt(x)', '/d 4.txt(y)'],)

Expected Output: [['/a/1.txt', '/b/2.txt', '/c/3.txt']]

Explanation: Content 'x' appears in three files, so they form one duplicate group. Content 'y' appears only once and is ignored.

Hints

  1. A hash map from file content to a list of full file paths lets you collect duplicates efficiently.
  2. For each file token like '1.txt(abcd)', find the first '(' to split the file name from its content.
Last updated: May 23, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)
  • Implement Task Management and Duplicate Detection - Anthropic (medium)