Design file deduplication algorithm
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates algorithm design and systems engineering competencies including chunking strategies, hash-function selection and collision mitigation, large-file and streaming processing, parallelization and I/O optimization, complexity analysis, and selection of lookup/index data structures such as hash tables, Bloom filters, or LSM-based indexes. Common in the Coding & Algorithms domain, it examines trade-offs between scalability, performance, and correctness under resource constraints and tests both conceptual understanding and practical application to real-world system-level constraints.
Constraints
- 0 <= len(files) <= 10^4
- 1 <= window_size <= fixed_size <= 10^5
- 1 <= divisor <= 10^9
- 0 <= target < divisor
- The sum of lengths of all file strings is at most 2 * 10^5
- Each file may be empty and contains ASCII characters
Examples
Input: ([], 4, 2, 5, 0)
Expected Output: [0, 0]
Explanation: There are no files, so neither strategy stores any bytes.
Input: (["abcdef", "abcdef"], 3, 2, 5, 0)
Expected Output: [6, 6]
Explanation: Fixed-size chunks are ["abc", "def"] for both files, so only 6 bytes are unique. CDC produces ["ab", "cde", "f"] for each file, again totaling 6 unique bytes.
Input: (["abcdefghi", "Xabcdefghi"], 4, 1, 3, 0)
Expected Output: [19, 13]
Explanation: Fixed-size chunking misaligns after the inserted leading X, so almost nothing matches across files. CDC cuts on characters whose ASCII value is divisible by 3 (c, f, i), producing ["abc", "def", "ghi"] and ["Xabc", "def", "ghi"], so only 13 bytes are unique.
Input: (["bbbbbb", "bbbb"], 4, 2, 10, 9)
Expected Output: [6, 6]
Explanation: The CDC boundary condition never matches because the rolling sum of "bb" is 196 and 196 % 10 = 6, not 9. CDC therefore falls back to the maximum chunk size and behaves like fixed chunking here.
Solution
from collections import deque
def solution(files, fixed_size, window_size, divisor, target):
def fixed_chunks(s):
for i in range(0, len(s), fixed_size):
yield s[i:i + fixed_size]
def cdc_chunks(s):
n = len(s)
if n == 0:
return
start = 0
window = deque()
window_sum = 0
for i, ch in enumerate(s):
val = ord(ch)
window.append(val)
window_sum += val
if len(window) > window_size:
window_sum -= window.popleft()
current_len = i - start + 1
should_cut = False
if current_len >= window_size and window_sum % divisor == target:
should_cut = True
if current_len == fixed_size:
should_cut = True
if should_cut:
yield s[start:i + 1]
start = i + 1
window.clear()
window_sum = 0
if start < n:
yield s[start:]
def unique_bytes(chunker):
seen = set()
total = 0
for file_content in files:
for chunk in chunker(file_content):
if chunk not in seen:
seen.add(chunk)
total += len(chunk)
return total
fixed_unique = unique_bytes(fixed_chunks)
cdc_unique = unique_bytes(cdc_chunks)
return [fixed_unique, cdc_unique]Time complexity: O(T), where T is the total number of characters across all files. Space complexity: O(U + window_size), where U is the total size of unique chunk contents stored in the hash set.
Hints
- You do not need to store every chunk occurrence. A hash set of chunks already seen is enough to count only unique stored bytes.
- For CDC, keep the rolling sum of the last window_size characters and reset that state every time you cut a chunk.