Parallelize chunked file download and verify integrity
Company: Baseten
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: nan
Interview Round: Technical Screen
Quick Answer: This question evaluates understanding of parallelized I/O, concurrency control, failure and retry handling, and data-integrity verification for chunked file transfers, falling under the Coding & Algorithms category and the systems-level domains of networking, file I/O, hashing, and concurrency.
Part 1: Simulate a Parallel Chunk Downloader
Constraints
- 0 <= number of chunks <= 10^4
- 1 <= k <= 10^3 for non-empty jobs
- Total number of scripted attempts across all chunks <= 10^5
- 0 <= offset, length, duration
- Sum of all chunk lengths <= 2 * 10^5
- Chunk indices used in `chunks` are unique and must be valid indices into `attempts`
Examples
Input: (2, [(0, 0, 2), (1, 2, 2), (2, 4, 1)], [[(3, 'ab')], [(2, None), (1, 'cd')], [(1, 'e')]])
Expected Output: (4, 'abcde')
Explanation: Chunk 1 fails once and retries later. The final file is assembled by offsets, and the total simulated time is 4.
Input: (2, [(0, 2, 2), (1, 0, 2)], [[(2, 'CD')], [(1, 'AB')]])
Expected Output: (2, 'ABCD')
Explanation: Chunk list order is not the same as file order. You must place data by offset, not by list position.
Input: (2, [(0, 0, 1), (1, 1, 1)], [[(1, None)], [(1, 'b')]])
Expected Output: (-1, '')
Explanation: Chunk 0 has no successful attempt, so the download fails.
Input: (3, [], [])
Expected Output: (0, '')
Explanation: Edge case: empty file.
Input: (1, [(0, 0, 2)], [[(1, 'a'), (1, 'ab')]])
Expected Output: (2, 'ab')
Explanation: The first attempt returns the wrong length, so it counts as a failure and the retry succeeds.
Hints
- Use one min-heap for running attempts keyed by finish time, and another for attempts that are ready to start.
- Do not store every successful chunk separately. Pre-allocate the final file buffer once and write successful chunks into it by offset.
Part 2: Validate Downloaded Chunks and Assembled File Integrity
Constraints
- 0 <= number of metadata chunks <= 10^4
- 0 <= number of downloaded records <= 10^4
- Metadata chunk indices are unique
- 0 <= offset, length
- Sum of all downloaded string lengths <= 2 * 10^5
- MD5 comparison should use lowercase hexadecimal strings
Examples
Input: ([(0, 0, 1, '0cc175b9c0f1b6a831c399e269772661'), (1, 1, 1, '92eb5ffee6ae2fec3ad71c777531578f'), (2, 2, 1, '4a8a08f09d37b73795649038408b5f33')], [(0, 'a'), (1, 'b'), (2, 'c')], 3, '900150983cd24fb0d6963f7d28e17f72')
Expected Output: ['OK']
Explanation: All chunks are present once, hashes match, and the assembled file is 'abc'.
Input: ([(0, 0, 1, None), (1, 1, 1, None)], [(0, 'a'), (0, 'a'), (5, 'z')], None, None)
Expected Output: ['UNKNOWN_INDEX', 'DUPLICATE_CHUNK', 'MISSING_CHUNK']
Explanation: Index 5 is not in metadata, chunk 0 appears twice, and chunk 1 is missing.
Input: ([(0, 0, 2, None), (1, 1, 1, None), (2, 4, 1, None)], [(0, 'ab'), (1, 'c'), (2, 'd')], None, None)
Expected Output: ['GAP', 'OVERLAP']
Explanation: Chunk 1 overlaps chunk 0, and there is a gap before chunk 2.
Input: ([(0, 0, 1, '0cc175b9c0f1b6a831c399e269772661'), (1, 1, 1, '92eb5ffee6ae2fec3ad71c777531578f')], [(0, 'a'), (1, 'x')], 2, '187ef4436122d1cc2f40dc2b92f0eba0')
Expected Output: ['CHUNK_HASH_MISMATCH', 'FILE_HASH_MISMATCH']
Explanation: Chunk 1 data has the right length but wrong content, so both the chunk hash and full-file hash fail.
Input: ([], [], 0, 'd41d8cd98f00b204e9800998ecf8427e')
Expected Output: ['OK']
Explanation: Edge case: empty file with the correct empty-file hash.
Hints
- Group downloaded records by chunk index first. That makes duplicate and missing checks easy.
- Sort metadata by offset to detect both gaps and overlaps before attempting any whole-file validation.