How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a Medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Design file deduplication algorithm | Anthropic Coding Question

Quick Overview

This question evaluates algorithm design and systems engineering competencies including chunking strategies, hash-function selection and collision mitigation, large-file and streaming processing, parallelization and I/O optimization, complexity analysis, and selection of lookup/index data structures such as hash tables, Bloom filters, or LSM-based indexes. Common in the Coding & Algorithms domain, it examines trade-offs between scalability, performance, and correctness under resource constraints and tests both conceptual understanding and practical application to real-world system-level constraints.

Design file deduplication algorithm

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

Design an algorithm to deduplicate files in a storage system. Compare fixed-size versus content-defined chunking and explain how you would choose hash functions (e.g., cryptographic hashes versus rolling hashes). Describe collision detection/mitigation, handling of very large files that do not fit in memory, streaming ingestion, and opportunities for parallelization and I/O optimization. Analyze time and space complexity and discuss data structures for fast lookups (e.g., hash tables, Bloom filters, LSM-based indexes).

Quick Answer: This question evaluates algorithm design and systems engineering competencies including chunking strategies, hash-function selection and collision mitigation, large-file and streaming processing, parallelization and I/O optimization, complexity analysis, and selection of lookup/index data structures such as hash tables, Bloom filters, or LSM-based indexes. Common in the Coding & Algorithms domain, it examines trade-offs between scalability, performance, and correctness under resource constraints and tests both conceptual understanding and practical application to real-world system-level constraints.

A storage system stores identical chunks only once. Real deduplication systems often use a rolling hash to choose chunk boundaries and a strong hash plus byte verification to confirm duplicates. In this problem, the boundary detector is simplified and duplicate detection is exact: two chunks are duplicates only if their full string contents are equal. Given a list of file contents, compute how many bytes must be stored under two strategies: (1) fixed-size chunking and (2) content-defined chunking (CDC). For fixed-size chunking, split each file into consecutive chunks of length fixed_size, except the last chunk which may be shorter. For CDC, scan each file from left to right and maintain a rolling sum of the ASCII codes of the last window_size characters of the current chunk. Cut a chunk after position i if the current chunk length is at least window_size and rolling_sum % divisor == target, or if the current chunk length reaches fixed_size (the maximum CDC chunk size). After a cut, start a new chunk and reset the rolling window. Any remaining suffix becomes the final chunk. Return [fixed_unique_bytes, cdc_unique_bytes]. Use a hash set for exact dedup lookup.

Constraints

0 <= len(files) <= 10^4
1 <= window_size <= fixed_size <= 10^5
1 <= divisor <= 10^9
0 <= target < divisor
The sum of lengths of all file strings is at most 2 * 10^5
Each file may be empty and contains ASCII characters

Examples

Input: ([], 4, 2, 5, 0)

Expected Output: [0, 0]

Explanation: There are no files, so neither strategy stores any bytes.

Input: (["abcdef", "abcdef"], 3, 2, 5, 0)

Expected Output: [6, 6]

Explanation: Fixed-size chunks are ["abc", "def"] for both files, so only 6 bytes are unique. CDC produces ["ab", "cde", "f"] for each file, again totaling 6 unique bytes.

Input: (["abcdefghi", "Xabcdefghi"], 4, 1, 3, 0)

Expected Output: [19, 13]

Explanation: Fixed-size chunking misaligns after the inserted leading X, so almost nothing matches across files. CDC cuts on characters whose ASCII value is divisible by 3 (c, f, i), producing ["abc", "def", "ghi"] and ["Xabc", "def", "ghi"], so only 13 bytes are unique.

Input: (["bbbbbb", "bbbb"], 4, 2, 10, 9)

Expected Output: [6, 6]

Explanation: The CDC boundary condition never matches because the rolling sum of "bb" is 196 and 196 % 10 = 6, not 9. CDC therefore falls back to the maximum chunk size and behaves like fixed chunking here.

Solution

from collections import deque\n\ndef solution(files, fixed_size, window_size, divisor, target):\n    def fixed_chunks(s):\n        for i in range(0, len(s), fixed_size):\n            yield s[i:i + fixed_size]\n\n    def cdc_chunks(s):\n        n = len(s)\n        if n == 0:\n            return\n\n        start = 0\n        window = deque()\n        window_sum = 0\n\n        for i, ch in enumerate(s):\n            val = ord(ch)\n            window.append(val)\n            window_sum += val\n\n            if len(window) > window_size:\n                window_sum -= window.popleft()\n\n            current_len = i - start + 1\n            should_cut = False\n\n            if current_len >= window_size and window_sum % divisor == target:\n                should_cut = True\n            if current_len == fixed_size:\n                should_cut = True\n\n            if should_cut:\n                yield s[start:i + 1]\n                start = i + 1\n                window.clear()\n                window_sum = 0\n\n        if start < n:\n            yield s[start:]\n\n    def unique_bytes(chunker):\n        seen = set()\n        total = 0\n        for file_content in files:\n            for chunk in chunker(file_content):\n                if chunk not in seen:\n                    seen.add(chunk)\n                    total += len(chunk)\n        return total\n\n    fixed_unique = unique_bytes(fixed_chunks)\n    cdc_unique = unique_bytes(cdc_chunks)\n    return [fixed_unique, cdc_unique]

Time complexity: O(T), where T is the total number of characters across all files. Space complexity: O(U + window_size), where U is the total size of unique chunk contents stored in the hash set.

Hints

You do not need to store every chunk occurrence. A hash set of chunks already seen is enough to count only unique stored bytes.
For CDC, keep the rolling sum of the last window_size characters and reset that state every time you cut a chunk.

Last updated: May 1, 2026

Loading coding console...

Quick Overview

Design file deduplication algorithm

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

Constraints

0 <= len(files) <= 10^4
1 <= window_size <= fixed_size <= 10^5
1 <= divisor <= 10^9
0 <= target < divisor
The sum of lengths of all file strings is at most 2 * 10^5
Each file may be empty and contains ASCII characters

Examples

Input: ([], 4, 2, 5, 0)

Expected Output: [0, 0]

Explanation: There are no files, so neither strategy stores any bytes.

Input: (["abcdef", "abcdef"], 3, 2, 5, 0)

Expected Output: [6, 6]

Explanation: Fixed-size chunks are ["abc", "def"] for both files, so only 6 bytes are unique. CDC produces ["ab", "cde", "f"] for each file, again totaling 6 unique bytes.

Input: (["abcdefghi", "Xabcdefghi"], 4, 1, 3, 0)

Expected Output: [19, 13]

Input: (["bbbbbb", "bbbb"], 4, 2, 10, 9)

Expected Output: [6, 6]

Solution

from collections import deque\n\ndef solution(files, fixed_size, window_size, divisor, target):\n    def fixed_chunks(s):\n        for i in range(0, len(s), fixed_size):\n            yield s[i:i + fixed_size]\n\n    def cdc_chunks(s):\n        n = len(s)\n        if n == 0:\n            return\n\n        start = 0\n        window = deque()\n        window_sum = 0\n\n        for i, ch in enumerate(s):\n            val = ord(ch)\n            window.append(val)\n            window_sum += val\n\n            if len(window) > window_size:\n                window_sum -= window.popleft()\n\n            current_len = i - start + 1\n            should_cut = False\n\n            if current_len >= window_size and window_sum % divisor == target:\n                should_cut = True\n            if current_len == fixed_size:\n                should_cut = True\n\n            if should_cut:\n                yield s[start:i + 1]\n                start = i + 1\n                window.clear()\n                window_sum = 0\n\n        if start < n:\n            yield s[start:]\n\n    def unique_bytes(chunker):\n        seen = set()\n        total = 0\n        for file_content in files:\n            for chunk in chunker(file_content):\n                if chunk not in seen:\n                    seen.add(chunk)\n                    total += len(chunk)\n        return total\n\n    fixed_unique = unique_bytes(fixed_chunks)\n    cdc_unique = unique_bytes(cdc_chunks)\n    return [fixed_unique, cdc_unique]

Time complexity: O(T), where T is the total number of characters across all files. Space complexity: O(U + window_size), where U is the total size of unique chunk contents stored in the hash set.

Hints

You do not need to store every chunk occurrence. A hash set of chunks already seen is enough to count only unique stored bytes.
For CDC, keep the rolling sum of the last window_size characters and reset that state every time you cut a chunk.

Last updated: May 1, 2026

Loading coding console...