PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCareers

Quick Overview

This question evaluates algorithm design and systems engineering competencies including chunking strategies, hash-function selection and collision mitigation, large-file and streaming processing, parallelization and I/O optimization, complexity analysis, and selection of lookup/index data structures such as hash tables, Bloom filters, or LSM-based indexes. Common in the Coding & Algorithms domain, it examines trade-offs between scalability, performance, and correctness under resource constraints and tests both conceptual understanding and practical application to real-world system-level constraints.

  • Medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Design file deduplication algorithm

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

Design an algorithm to deduplicate files in a storage system. Compare fixed-size versus content-defined chunking and explain how you would choose hash functions (e.g., cryptographic hashes versus rolling hashes). Describe collision detection/mitigation, handling of very large files that do not fit in memory, streaming ingestion, and opportunities for parallelization and I/O optimization. Analyze time and space complexity and discuss data structures for fast lookups (e.g., hash tables, Bloom filters, LSM-based indexes).

Quick Answer: This question evaluates algorithm design and systems engineering competencies including chunking strategies, hash-function selection and collision mitigation, large-file and streaming processing, parallelization and I/O optimization, complexity analysis, and selection of lookup/index data structures such as hash tables, Bloom filters, or LSM-based indexes. Common in the Coding & Algorithms domain, it examines trade-offs between scalability, performance, and correctness under resource constraints and tests both conceptual understanding and practical application to real-world system-level constraints.

A storage system stores identical chunks only once. Real deduplication systems often use a rolling hash to choose chunk boundaries and a strong hash plus byte verification to confirm duplicates. In this problem, the boundary detector is simplified and duplicate detection is exact: two chunks are duplicates only if their full string contents are equal. Given a list of file contents, compute how many bytes must be stored under two strategies: (1) fixed-size chunking and (2) content-defined chunking (CDC). For fixed-size chunking, split each file into consecutive chunks of length fixed_size, except the last chunk which may be shorter. For CDC, scan each file from left to right and maintain a rolling sum of the ASCII codes of the last window_size characters of the current chunk. Cut a chunk after position i if the current chunk length is at least window_size and rolling_sum % divisor == target, or if the current chunk length reaches fixed_size (the maximum CDC chunk size). After a cut, start a new chunk and reset the rolling window. Any remaining suffix becomes the final chunk. Return [fixed_unique_bytes, cdc_unique_bytes]. Use a hash set for exact dedup lookup.

Constraints

  • 0 <= len(files) <= 10^4
  • 1 <= window_size <= fixed_size <= 10^5
  • 1 <= divisor <= 10^9
  • 0 <= target < divisor
  • The sum of lengths of all file strings is at most 2 * 10^5
  • Each file may be empty and contains ASCII characters

Examples

Input: ([], 4, 2, 5, 0)

Expected Output: [0, 0]

Explanation: There are no files, so neither strategy stores any bytes.

Input: (["abcdef", "abcdef"], 3, 2, 5, 0)

Expected Output: [6, 6]

Explanation: Fixed-size chunks are ["abc", "def"] for both files, so only 6 bytes are unique. CDC produces ["ab", "cde", "f"] for each file, again totaling 6 unique bytes.

Input: (["abcdefghi", "Xabcdefghi"], 4, 1, 3, 0)

Expected Output: [19, 13]

Explanation: Fixed-size chunking misaligns after the inserted leading X, so almost nothing matches across files. CDC cuts on characters whose ASCII value is divisible by 3 (c, f, i), producing ["abc", "def", "ghi"] and ["Xabc", "def", "ghi"], so only 13 bytes are unique.

Input: (["bbbbbb", "bbbb"], 4, 2, 10, 9)

Expected Output: [6, 6]

Explanation: The CDC boundary condition never matches because the rolling sum of "bb" is 196 and 196 % 10 = 6, not 9. CDC therefore falls back to the maximum chunk size and behaves like fixed chunking here.

Solution

from collections import deque\n\ndef solution(files, fixed_size, window_size, divisor, target):\n    def fixed_chunks(s):\n        for i in range(0, len(s), fixed_size):\n            yield s[i:i + fixed_size]\n\n    def cdc_chunks(s):\n        n = len(s)\n        if n == 0:\n            return\n\n        start = 0\n        window = deque()\n        window_sum = 0\n\n        for i, ch in enumerate(s):\n            val = ord(ch)\n            window.append(val)\n            window_sum += val\n\n            if len(window) > window_size:\n                window_sum -= window.popleft()\n\n            current_len = i - start + 1\n            should_cut = False\n\n            if current_len >= window_size and window_sum % divisor == target:\n                should_cut = True\n            if current_len == fixed_size:\n                should_cut = True\n\n            if should_cut:\n                yield s[start:i + 1]\n                start = i + 1\n                window.clear()\n                window_sum = 0\n\n        if start < n:\n            yield s[start:]\n\n    def unique_bytes(chunker):\n        seen = set()\n        total = 0\n        for file_content in files:\n            for chunk in chunker(file_content):\n                if chunk not in seen:\n                    seen.add(chunk)\n                    total += len(chunk)\n        return total\n\n    fixed_unique = unique_bytes(fixed_chunks)\n    cdc_unique = unique_bytes(cdc_chunks)\n    return [fixed_unique, cdc_unique]

Time complexity: O(T), where T is the total number of characters across all files. Space complexity: O(U + window_size), where U is the total size of unique chunk contents stored in the hash set.

Hints

  1. You do not need to store every chunk occurrence. A hash set of chunks already seen is enough to count only unique stored bytes.
  2. For CDC, keep the rolling sum of the last window_size characters and reset that state every time you cut a chunk.
Last updated: May 1, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • Careers
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Convert Samples into Event Intervals - Anthropic (medium)
  • Convert State Stream to Events - Anthropic (medium)
  • Build a concurrent web crawler - Anthropic (medium)
  • Implement a Parallel Image Processor - Anthropic (medium)
  • Implement a Batch Image Processor - Anthropic (medium)