PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates understanding of parallelized I/O, concurrency control, failure and retry handling, and data-integrity verification for chunked file transfers, falling under the Coding & Algorithms category and the systems-level domains of networking, file I/O, hashing, and concurrency.

  • nan
  • Baseten
  • Coding & Algorithms
  • Software Engineer

Parallelize chunked file download and verify integrity

Company: Baseten

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: nan

Interview Round: Technical Screen

You are implementing a simplified `s3 sync`-style download for a large remote object. A file is stored remotely as **N fixed-size chunks** (except possibly the last). You are given boilerplate that currently downloads chunks **sequentially** and appends them to a local file. ## Provided (conceptual) interfaces - `chunks = list_chunks(object_key)` returns metadata for all chunks: - `index` (0..N-1) - `offset` (byte offset in the final file) - `length` - optionally: `expected_chunk_hash` (e.g., SHA-256) - `bytes = download_chunk(object_key, index)` downloads a single chunk’s bytes. - Optionally you may be given `expected_file_hash` (hash of the full file) and/or `expected_file_size`. ## Task 1: Parallel chunk download Rewrite the sequential implementation so that chunks are downloaded **in parallel** to improve throughput. Requirements/constraints: - Add a **max concurrency** limit `K` (e.g., 8–64). - Ensure the final on-disk file is assembled correctly (chunks must end up at the correct offsets). - Keep memory usage reasonable (do not necessarily store all chunks in memory at once). - Handle failures (timeouts/HTTP errors) with reasonable retry behavior; ensure partial progress does not corrupt the final output. ## Task 2 (follow-up): Validate correctness Describe (and/or implement) how you would check that: 1. Each chunk downloaded correctly. 2. The fully assembled local file matches what was expected from the remote source. Be explicit about what signals/metadata you would use (e.g., per-chunk hashes, whole-file hash, length checks, ordering/offset validation), and what edge cases you would consider (last chunk size, missing/duplicate chunks, retries writing to the same offset, etc.).

Quick Answer: This question evaluates understanding of parallelized I/O, concurrency control, failure and retry handling, and data-integrity verification for chunked file transfers, falling under the Coding & Algorithms category and the systems-level domains of networking, file I/O, hashing, and concurrency.

Part 1: Simulate a Parallel Chunk Downloader

You are given metadata for chunks of a file and a scripted set of download attempts for each chunk. Implement a deterministic simulation of a parallel downloader. Each chunk is described by `(index, offset, length)`. The final file is made by placing the successful bytes of each chunk at its `offset`. For each chunk index `i`, `attempts[i]` is a list of `(duration, data)` pairs tried in order: - `duration` is how long that attempt runs. - `data` is either a string or `None`. - An attempt is successful only if `data is not None` and `len(data) == length` for that chunk. - Otherwise, that attempt fails. Simulation rules: 1. At time `0`, the first attempt for every chunk becomes available. 2. At most `k` attempts may run at the same time. 3. When an attempt fails, the next attempt for that same chunk becomes available immediately at that finish time. 4. When a worker is free, start the available attempt with the smallest `(available_time, chunk_index)`. 5. If several attempts finish at the same time, process all of those completions before starting any new attempts at that same time. 6. The chunk metadata must describe exactly one contiguous file starting at offset `0` with no gaps or overlaps. Return `(total_time, file_contents)` if all chunks eventually succeed. If any chunk never succeeds, the metadata is invalid, or `k <= 0` for a non-empty job, return `(-1, '')`. Write successful chunk data directly into the final output buffer.

Constraints

  • 0 <= number of chunks <= 10^4
  • 1 <= k <= 10^3 for non-empty jobs
  • Total number of scripted attempts across all chunks <= 10^5
  • 0 <= offset, length, duration
  • Sum of all chunk lengths <= 2 * 10^5
  • Chunk indices used in `chunks` are unique and must be valid indices into `attempts`

Examples

Input: (2, [(0, 0, 2), (1, 2, 2), (2, 4, 1)], [[(3, 'ab')], [(2, None), (1, 'cd')], [(1, 'e')]])

Expected Output: (4, 'abcde')

Explanation: Chunk 1 fails once and retries later. The final file is assembled by offsets, and the total simulated time is 4.

Input: (2, [(0, 2, 2), (1, 0, 2)], [[(2, 'CD')], [(1, 'AB')]])

Expected Output: (2, 'ABCD')

Explanation: Chunk list order is not the same as file order. You must place data by offset, not by list position.

Input: (2, [(0, 0, 1), (1, 1, 1)], [[(1, None)], [(1, 'b')]])

Expected Output: (-1, '')

Explanation: Chunk 0 has no successful attempt, so the download fails.

Input: (3, [], [])

Expected Output: (0, '')

Explanation: Edge case: empty file.

Input: (1, [(0, 0, 2)], [[(1, 'a'), (1, 'ab')]])

Expected Output: (2, 'ab')

Explanation: The first attempt returns the wrong length, so it counts as a failure and the retry succeeds.

Hints

  1. Use one min-heap for running attempts keyed by finish time, and another for attempts that are ready to start.
  2. Do not store every successful chunk separately. Pre-allocate the final file buffer once and write successful chunks into it by offset.

Part 2: Validate Downloaded Chunks and Assembled File Integrity

You are given authoritative chunk metadata and a set of downloaded chunk records. Your job is to validate both chunk-level correctness and full-file correctness. For this problem, hashes are MD5 hex strings to keep the input short. If a hash field is `None`, skip that hash check. Each metadata tuple is `(index, offset, length, expected_chunk_hash)`. Each downloaded tuple is `(index, data)`. Validation rules: 1. Detect unknown downloaded indices. 2. Detect duplicate downloaded chunks for the same index. 3. Detect missing chunks. 4. Validate that metadata offsets describe a single contiguous file starting at `0` with no gaps or overlaps. 5. For each chunk that appears exactly once, check its length. 6. If `expected_chunk_hash` is present, check the chunk hash. 7. Only if the file can be assembled unambiguously from the metadata and downloads, check: - `expected_file_size` - `expected_file_hash` Return all detected error codes in this exact order: `UNKNOWN_INDEX`, `DUPLICATE_CHUNK`, `MISSING_CHUNK`, `GAP`, `OVERLAP`, `LENGTH_MISMATCH`, `CHUNK_HASH_MISMATCH`, `FILE_SIZE_MISMATCH`, `FILE_HASH_MISMATCH` If there are no errors, return `['OK']`.

Constraints

  • 0 <= number of metadata chunks <= 10^4
  • 0 <= number of downloaded records <= 10^4
  • Metadata chunk indices are unique
  • 0 <= offset, length
  • Sum of all downloaded string lengths <= 2 * 10^5
  • MD5 comparison should use lowercase hexadecimal strings

Examples

Input: ([(0, 0, 1, '0cc175b9c0f1b6a831c399e269772661'), (1, 1, 1, '92eb5ffee6ae2fec3ad71c777531578f'), (2, 2, 1, '4a8a08f09d37b73795649038408b5f33')], [(0, 'a'), (1, 'b'), (2, 'c')], 3, '900150983cd24fb0d6963f7d28e17f72')

Expected Output: ['OK']

Explanation: All chunks are present once, hashes match, and the assembled file is 'abc'.

Input: ([(0, 0, 1, None), (1, 1, 1, None)], [(0, 'a'), (0, 'a'), (5, 'z')], None, None)

Expected Output: ['UNKNOWN_INDEX', 'DUPLICATE_CHUNK', 'MISSING_CHUNK']

Explanation: Index 5 is not in metadata, chunk 0 appears twice, and chunk 1 is missing.

Input: ([(0, 0, 2, None), (1, 1, 1, None), (2, 4, 1, None)], [(0, 'ab'), (1, 'c'), (2, 'd')], None, None)

Expected Output: ['GAP', 'OVERLAP']

Explanation: Chunk 1 overlaps chunk 0, and there is a gap before chunk 2.

Input: ([(0, 0, 1, '0cc175b9c0f1b6a831c399e269772661'), (1, 1, 1, '92eb5ffee6ae2fec3ad71c777531578f')], [(0, 'a'), (1, 'x')], 2, '187ef4436122d1cc2f40dc2b92f0eba0')

Expected Output: ['CHUNK_HASH_MISMATCH', 'FILE_HASH_MISMATCH']

Explanation: Chunk 1 data has the right length but wrong content, so both the chunk hash and full-file hash fail.

Input: ([], [], 0, 'd41d8cd98f00b204e9800998ecf8427e')

Expected Output: ['OK']

Explanation: Edge case: empty file with the correct empty-file hash.

Hints

  1. Group downloaded records by chunk index first. That makes duplicate and missing checks easy.
  2. Sort metadata by offset to detect both gaps and overlaps before attempting any whole-file validation.
Last updated: Apr 22, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.