Design file deduplication algorithm
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates algorithm design and systems engineering competencies including chunking strategies, hash-function selection and collision mitigation, large-file and streaming processing, parallelization and I/O optimization, complexity analysis, and selection of lookup/index data structures such as hash tables, Bloom filters, or LSM-based indexes. Common in the Coding & Algorithms domain, it examines trade-offs between scalability, performance, and correctness under resource constraints and tests both conceptual understanding and practical application to real-world system-level constraints.
Constraints
- 0 <= len(files) <= 10^4
- 1 <= window_size <= fixed_size <= 10^5
- 1 <= divisor <= 10^9
- 0 <= target < divisor
- The sum of lengths of all file strings is at most 2 * 10^5
- Each file may be empty and contains ASCII characters
Examples
Input: ([], 4, 2, 5, 0)
Expected Output: [0, 0]
Explanation: There are no files, so neither strategy stores any bytes.
Input: (["abcdef", "abcdef"], 3, 2, 5, 0)
Expected Output: [6, 6]
Explanation: Fixed-size chunks are ["abc", "def"] for both files, so only 6 bytes are unique. CDC produces ["ab", "cde", "f"] for each file, again totaling 6 unique bytes.
Input: (["abcdefghi", "Xabcdefghi"], 4, 1, 3, 0)
Expected Output: [19, 13]
Explanation: Fixed-size chunking misaligns after the inserted leading X, so almost nothing matches across files. CDC cuts on characters whose ASCII value is divisible by 3 (c, f, i), producing ["abc", "def", "ghi"] and ["Xabc", "def", "ghi"], so only 13 bytes are unique.
Input: (["bbbbbb", "bbbb"], 4, 2, 10, 9)
Expected Output: [6, 6]
Explanation: The CDC boundary condition never matches because the rolling sum of "bb" is 196 and 196 % 10 = 6, not 9. CDC therefore falls back to the maximum chunk size and behaves like fixed chunking here.
Solution
from collections import deque\n\ndef solution(files, fixed_size, window_size, divisor, target):\n def fixed_chunks(s):\n for i in range(0, len(s), fixed_size):\n yield s[i:i + fixed_size]\n\n def cdc_chunks(s):\n n = len(s)\n if n == 0:\n return\n\n start = 0\n window = deque()\n window_sum = 0\n\n for i, ch in enumerate(s):\n val = ord(ch)\n window.append(val)\n window_sum += val\n\n if len(window) > window_size:\n window_sum -= window.popleft()\n\n current_len = i - start + 1\n should_cut = False\n\n if current_len >= window_size and window_sum % divisor == target:\n should_cut = True\n if current_len == fixed_size:\n should_cut = True\n\n if should_cut:\n yield s[start:i + 1]\n start = i + 1\n window.clear()\n window_sum = 0\n\n if start < n:\n yield s[start:]\n\n def unique_bytes(chunker):\n seen = set()\n total = 0\n for file_content in files:\n for chunk in chunker(file_content):\n if chunk not in seen:\n seen.add(chunk)\n total += len(chunk)\n return total\n\n fixed_unique = unique_bytes(fixed_chunks)\n cdc_unique = unique_bytes(cdc_chunks)\n return [fixed_unique, cdc_unique]
Time complexity: O(T), where T is the total number of characters across all files. Space complexity: O(U + window_size), where U is the total size of unique chunk contents stored in the hash set.
Hints
- You do not need to store every chunk occurrence. A hash set of chunks already seen is enough to count only unique stored bytes.
- For CDC, keep the rolling sum of the last window_size characters and reset that state every time you cut a chunk.