PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates algorithm design and systems engineering competencies including chunking strategies, hash-function selection and collision mitigation, large-file and streaming processing, parallelization and I/O optimization, complexity analysis, and selection of lookup/index data structures such as hash tables, Bloom filters, or LSM-based indexes. Common in the Coding & Algorithms domain, it examines trade-offs between scalability, performance, and correctness under resource constraints and tests both conceptual understanding and practical application to real-world system-level constraints.

  • Medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Design file deduplication algorithm

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

Design an algorithm to deduplicate files in a storage system. Compare fixed-size versus content-defined chunking and explain how you would choose hash functions (e.g., cryptographic hashes versus rolling hashes). Describe collision detection/mitigation, handling of very large files that do not fit in memory, streaming ingestion, and opportunities for parallelization and I/O optimization. Analyze time and space complexity and discuss data structures for fast lookups (e.g., hash tables, Bloom filters, LSM-based indexes).

Quick Answer: This question evaluates algorithm design and systems engineering competencies including chunking strategies, hash-function selection and collision mitigation, large-file and streaming processing, parallelization and I/O optimization, complexity analysis, and selection of lookup/index data structures such as hash tables, Bloom filters, or LSM-based indexes. Common in the Coding & Algorithms domain, it examines trade-offs between scalability, performance, and correctness under resource constraints and tests both conceptual understanding and practical application to real-world system-level constraints.

A storage system stores identical chunks only once. Real deduplication systems often use a rolling hash to choose chunk boundaries and a strong hash plus byte verification to confirm duplicates. In this problem, the boundary detector is simplified and duplicate detection is exact: two chunks are duplicates only if their full string contents are equal. Given a list of file contents, compute how many bytes must be stored under two strategies: (1) fixed-size chunking and (2) content-defined chunking (CDC). For fixed-size chunking, split each file into consecutive chunks of length fixed_size, except the last chunk which may be shorter. For CDC, scan each file from left to right and maintain a rolling sum of the ASCII codes of the last window_size characters of the current chunk. Cut a chunk after position i if the current chunk length is at least window_size and rolling_sum % divisor == target, or if the current chunk length reaches fixed_size (the maximum CDC chunk size). After a cut, start a new chunk and reset the rolling window. Any remaining suffix becomes the final chunk. Return [fixed_unique_bytes, cdc_unique_bytes]. Use a hash set for exact dedup lookup.

Constraints

  • 0 <= len(files) <= 10^4
  • 1 <= window_size <= fixed_size <= 10^5
  • 1 <= divisor <= 10^9
  • 0 <= target < divisor
  • The sum of lengths of all file strings is at most 2 * 10^5
  • Each file may be empty and contains ASCII characters

Examples

Input: ([], 4, 2, 5, 0)

Expected Output: [0, 0]

Explanation: There are no files, so neither strategy stores any bytes.

Input: (["abcdef", "abcdef"], 3, 2, 5, 0)

Expected Output: [6, 6]

Explanation: Fixed-size chunks are ["abc", "def"] for both files, so only 6 bytes are unique. CDC produces ["ab", "cde", "f"] for each file, again totaling 6 unique bytes.

Input: (["abcdefghi", "Xabcdefghi"], 4, 1, 3, 0)

Expected Output: [19, 13]

Explanation: Fixed-size chunking misaligns after the inserted leading X, so almost nothing matches across files. CDC cuts on characters whose ASCII value is divisible by 3 (c, f, i), producing ["abc", "def", "ghi"] and ["Xabc", "def", "ghi"], so only 13 bytes are unique.

Input: (["bbbbbb", "bbbb"], 4, 2, 10, 9)

Expected Output: [6, 6]

Explanation: The CDC boundary condition never matches because the rolling sum of "bb" is 196 and 196 % 10 = 6, not 9. CDC therefore falls back to the maximum chunk size and behaves like fixed chunking here.

Solution

from collections import deque

def solution(files, fixed_size, window_size, divisor, target):
    def fixed_chunks(s):
        for i in range(0, len(s), fixed_size):
            yield s[i:i + fixed_size]

    def cdc_chunks(s):
        n = len(s)
        if n == 0:
            return

        start = 0
        window = deque()
        window_sum = 0

        for i, ch in enumerate(s):
            val = ord(ch)
            window.append(val)
            window_sum += val

            if len(window) > window_size:
                window_sum -= window.popleft()

            current_len = i - start + 1
            should_cut = False

            if current_len >= window_size and window_sum % divisor == target:
                should_cut = True
            if current_len == fixed_size:
                should_cut = True

            if should_cut:
                yield s[start:i + 1]
                start = i + 1
                window.clear()
                window_sum = 0

        if start < n:
            yield s[start:]

    def unique_bytes(chunker):
        seen = set()
        total = 0
        for file_content in files:
            for chunk in chunker(file_content):
                if chunk not in seen:
                    seen.add(chunk)
                    total += len(chunk)
        return total

    fixed_unique = unique_bytes(fixed_chunks)
    cdc_unique = unique_bytes(cdc_chunks)
    return [fixed_unique, cdc_unique]

Time complexity: O(T), where T is the total number of characters across all files. Space complexity: O(U + window_size), where U is the total size of unique chunk contents stored in the hash set.

Hints

  1. You do not need to store every chunk occurrence. A hash set of chunks already seen is enough to count only unique stored bytes.
  2. For CDC, keep the rolling sum of the last window_size characters and reset that state every time you cut a chunk.
Last updated: May 1, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)
  • Implement Task Management and Duplicate Detection - Anthropic (medium)