How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a hard difficulty Coding & Algorithms question, commonly asked during Onsite rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Find Duplicate Files by Content | Anthropic Coding Question

Q: Find Duplicate Files by Content

This question evaluates the ability to parse file-system-style input, group files by identical content, and manage deduplication using appropriate data structures and algorithms.

Find Duplicate Files by Content

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Onsite

You are given a list of directory descriptions. Each description has the following format: `root_directory file_name_1(content_1) file_name_2(content_2) ...` For example: `/data/a 1.txt(abcd) 2.txt(efgh)` This means the directory `/data/a` contains file `/data/a/1.txt` with content `abcd` and file `/data/a/2.txt` with content `efgh`. Return all groups of file paths whose file contents are identical. Only include groups with at least two files. The order of groups and the order of paths inside each group do not matter. Example: Input: `["/data/a 1.txt(abcd) 2.txt(efgh)", "/data/b 3.txt(abcd)", "/data/c 4.txt(efgh) 5.txt(xyz)"]` Output: `[["/data/a/1.txt", "/data/b/3.txt"], ["/data/a/2.txt", "/data/c/4.txt"]]` Constraints: - `1 <= number of directory descriptions <= 10^4` - The total number of files is at most `10^5`. - File names and directory names do not contain spaces. - File content is represented as a string without parentheses.

Quick Answer: This question evaluates the ability to parse file-system-style input, group files by identical content, and manage deduplication using appropriate data structures and algorithms.

You are given a list of directory descriptions. Each description has the format: root_directory file_name_1(content_1) file_name_2(content_2) ... For example, the description '/data/a 1.txt(abcd) 2.txt(efgh)' means the directory '/data/a' contains the files '/data/a/1.txt' with content 'abcd' and '/data/a/2.txt' with content 'efgh'. Your task is to return all groups of file paths whose file contents are identical. Only include groups that contain at least two files. For this problem, return the result in a deterministic order: sort each group of file paths lexicographically, then sort the list of groups by the first path in each group. If there are no duplicate files, return an empty list.

Constraints

1 <= len(descriptions) <= 10^4
The total number of files across all descriptions is at most 10^5
Directory names and file names do not contain spaces
File content does not contain parentheses

Examples

Input: (['/data/a 1.txt(abcd) 2.txt(efgh)', '/data/b 3.txt(abcd)', '/data/c 4.txt(efgh) 5.txt(xyz)'],)

Expected Output: [['/data/a/1.txt', '/data/b/3.txt'], ['/data/a/2.txt', '/data/c/4.txt']]

Explanation: Files with content 'abcd' are '/data/a/1.txt' and '/data/b/3.txt'. Files with content 'efgh' are '/data/a/2.txt' and '/data/c/4.txt'. 'xyz' appears only once, so it is excluded.

Input: (['/docs a.txt(hello) b.txt(world)'],)

Expected Output: []

Explanation: Both files have unique contents, so there are no duplicate groups.

Input: (['/root a.txt(z) b.txt(z) c.txt(y)', '/other d.txt(y) e.txt(z)'],)

Expected Output: [['/other/d.txt', '/root/c.txt'], ['/other/e.txt', '/root/a.txt', '/root/b.txt']]

Explanation: Content 'y' appears in '/root/c.txt' and '/other/d.txt'. Content 'z' appears in '/root/a.txt', '/root/b.txt', and '/other/e.txt'. Each group is sorted, then the groups are sorted by their first path.

Input: (['/a 1.txt(x)', '/b 2.txt(x)', '/c 3.txt(x)', '/d 4.txt(y)'],)

Expected Output: [['/a/1.txt', '/b/2.txt', '/c/3.txt']]

Explanation: Content 'x' appears in three files, so they form one duplicate group. Content 'y' appears only once and is ignored.

Hints

A hash map from file content to a list of full file paths lets you collect duplicates efficiently.
For each file token like '1.txt(abcd)', find the first '(' to split the file name from its content.

Quick Overview

This question evaluates the ability to parse file-system-style input, group files by identical content, and manage deduplication using appropriate data structures and algorithms.