Implement Persistent KV Store Serialization
Company: OpenAI
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Technical Screen
Quick Answer: This question evaluates understanding of persistent storage design, binary-safe serialization, exact state restoration, and mutation logging for a key-value store, testing file I/O, data structures, and storage formats within the Coding & Algorithms domain.
Part 1: Serialize and Deserialize a Persistent In-Memory KV Store
Constraints
- `0 <= len(operations) <= 10^4`
- Keys and values are only of type `bytes` or `str`
- Total size of all live keys and values at any moment is at most `10^6` bytes
- Every `deserialize(path)` refers to a path that was previously serialized
Examples
Input: ([('put', 'a', '1'), ('get', 'a'), ('serialize', 'snap1'), ('put', 'a', '2'), ('deserialize', 'snap1'), ('get', 'a')],)
Expected Output: ['1', '1']
Explanation: After deserializing `snap1`, the old value `'1'` is restored.
Input: ([('put', b'a|b', b'v:1'), ('serialize', 'p'), ('delete', b'a|b'), ('get', b'a|b'), ('deserialize', 'p'), ('get', b'a|b')],)
Expected Output: [None, b'v:1']
Explanation: The format must not break when keys or values contain delimiter-like bytes.
Input: ([('serialize', 'empty'), ('put', b'x', b'y'), ('deserialize', 'empty'), ('get', b'x')],)
Expected Output: [None]
Explanation: Edge case: serializing and restoring an empty store should remove later changes.
Input: ([('put', 'ключ', b'value'), ('put', b'bin', 'тест'), ('serialize', 'mix'), ('delete', 'ключ'), ('deserialize', 'mix'), ('get', 'ключ'), ('get', b'bin')],)
Expected Output: [b'value', 'тест']
Explanation: Mixed `str` and `bytes` entries must be restored with the same types.
Hints
- Store lengths explicitly. A length-prefixed binary format is safe even when the data contains `:`, `|`, or other separator characters.
- If you support both `bytes` and `str`, add a small type tag before each encoded field so deserialization can restore the exact type.
Part 2: Persist the KV Store Across Multiple Segments
Constraints
- `9 <= max_segment_size <= 10^6`
- `0 <= len(operations) <= 10^4`
- Total serialized snapshot size at any moment is at most `10^6` bytes
- Every `deserialize(prefix)` and `segment_count(prefix)` refers to a prefix that was previously serialized
- Every `reorder(prefix, order)` uses a valid permutation of the existing segment indices
Examples
Input: (15, [('put', b'a', b'12345'), ('put', b'b', b'67890'), ('serialize', 'p'), ('segment_count', 'p'), ('delete', b'a'), ('deserialize', 'p'), ('get', b'a'), ('get', b'b')])
Expected Output: [6, b'12345', b'67890']
Explanation: The snapshot spans multiple segments when the per-segment size limit is small.
Input: (20, [('put', b'aa', b'xx'), ('put', b'bb', b'yy'), ('serialize', 'snap'), ('segment_count', 'snap'), ('reorder', 'snap', [2, 0, 1]), ('delete', b'aa'), ('deserialize', 'snap'), ('get', b'aa'), ('get', b'bb')])
Expected Output: [3, b'xx', b'yy']
Explanation: Even after reordering the saved segments, deserialization must rebuild the correct store.
Input: (12, [('serialize', 'empty'), ('segment_count', 'empty'), ('put', b'a', b'b'), ('deserialize', 'empty'), ('get', b'a')])
Expected Output: [1, None]
Explanation: Edge case: an empty snapshot is still a valid serialized snapshot and fits in one segment.
Input: (13, [('put', b'k', b'abcdefghij'), ('serialize', 'big'), ('segment_count', 'big'), ('deserialize', 'big'), ('get', b'k')])
Expected Output: [5, b'abcdefghij']
Explanation: A single large value may need many segments; the restored value must be exact.
Hints
- A clean approach is to serialize the whole snapshot into one binary stream first, then split that stream into fixed-size payload chunks.
- Add a fixed-size header to each segment containing at least the segment index and the total number of segments.
Part 3: Add an Append-Only Mutation Log with Replay and Compaction
Constraints
- `1 <= compact_threshold <= 10^4`
- `0 <= len(operations) <= 10^4`
- Keys and values are only of type `bytes` or `str`
- Total size of snapshot plus log at any moment is at most `10^6` bytes
Examples
Input: (3, [('put', 'a', '1'), ('put', 'b', '2'), ('restart',), ('get', 'a'), ('get', 'b'), ('status',)])
Expected Output: ['1', '2', (2, 2, 0)]
Explanation: With threshold 3, two mutations stay only in the log; restart must replay them correctly.
Input: (2, [('put', 'x', '1'), ('status',), ('put', 'y', '2'), ('status',), ('restart',), ('get', 'x'), ('get', 'y'), ('status',)])
Expected Output: [(1, 1, 0), (2, 0, 1), '1', '2', (2, 0, 1)]
Explanation: The second mutation reaches the threshold, so the log is compacted into a snapshot and cleared.
Input: (10, [('put', 'a', '1'), ('put', 'b', '2'), ('delete', 'a'), ('restart',), ('get', 'a'), ('get', 'b'), ('status',)])
Expected Output: [None, '2', (1, 3, 0)]
Explanation: Replay must apply the delete after the earlier put, leaving only `'b'`.
Input: (1, [('restart',), ('get', 'missing'), ('put', b'k', b'v'), ('status',), ('delete', b'k'), ('restart',), ('get', b'k'), ('status',)])
Expected Output: [None, (1, 0, 1), None, (0, 0, 2)]
Explanation: Edge case: threshold 1 means every mutation compacts immediately.
Hints
- Keep snapshot encoding and log encoding separate. The snapshot stores full state, while the log stores only mutations.
- On `restart`, rebuild by loading the snapshot first and then replaying each log record in order. After each new mutation, check whether it is time to compact.