PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates understanding of persistent storage design, binary-safe serialization, exact state restoration, and mutation logging for a key-value store, testing file I/O, data structures, and storage formats within the Coding & Algorithms domain.

  • hard
  • OpenAI
  • Coding & Algorithms
  • Software Engineer

Implement Persistent KV Store Serialization

Company: OpenAI

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Technical Screen

Implement a persistent in-memory key-value store. Requirements: - Keys and values are arbitrary byte strings or UTF-8 strings; do not rely on delimiters that may appear in the data. - Support put(key, value), get(key), delete(key), serialize(path), and deserialize(path). - Deserialization must restore exactly the state that was serialized. Follow-ups: 1. The data may not fit into a single file. Extend the persistence format to split data across multiple files or segments and restore from them. 2. A full rewrite on shutdown is too expensive after the store has been restored and modified. Add an append-only log for mutations, implement log replay during startup, and define when to compact the log into a snapshot.

Quick Answer: This question evaluates understanding of persistent storage design, binary-safe serialization, exact state restoration, and mutation logging for a key-value store, testing file I/O, data structures, and storage formats within the Coding & Algorithms domain.

Part 1: Serialize and Deserialize a Persistent In-Memory KV Store

Implement a persistent in-memory key-value store using a fake disk inside the function. The store supports `put`, `get`, `delete`, `serialize(path)`, and `deserialize(path)`. Each key and value is either a Python `bytes` object or a `str`. Your persistence format must be binary-safe: do not rely on delimiters that may appear inside the data. When `deserialize(path)` is called, the current in-memory store must become exactly the state that was previously written by `serialize(path)`. Because an online judge should not depend on real file I/O, treat `path` as a label in an internal dictionary that represents disk.

Constraints

  • `0 <= len(operations) <= 10^4`
  • Keys and values are only of type `bytes` or `str`
  • Total size of all live keys and values at any moment is at most `10^6` bytes
  • Every `deserialize(path)` refers to a path that was previously serialized

Examples

Input: ([('put', 'a', '1'), ('get', 'a'), ('serialize', 'snap1'), ('put', 'a', '2'), ('deserialize', 'snap1'), ('get', 'a')],)

Expected Output: ['1', '1']

Explanation: After deserializing `snap1`, the old value `'1'` is restored.

Input: ([('put', b'a|b', b'v:1'), ('serialize', 'p'), ('delete', b'a|b'), ('get', b'a|b'), ('deserialize', 'p'), ('get', b'a|b')],)

Expected Output: [None, b'v:1']

Explanation: The format must not break when keys or values contain delimiter-like bytes.

Input: ([('serialize', 'empty'), ('put', b'x', b'y'), ('deserialize', 'empty'), ('get', b'x')],)

Expected Output: [None]

Explanation: Edge case: serializing and restoring an empty store should remove later changes.

Input: ([('put', 'ключ', b'value'), ('put', b'bin', 'тест'), ('serialize', 'mix'), ('delete', 'ключ'), ('deserialize', 'mix'), ('get', 'ключ'), ('get', b'bin')],)

Expected Output: [b'value', 'тест']

Explanation: Mixed `str` and `bytes` entries must be restored with the same types.

Hints

  1. Store lengths explicitly. A length-prefixed binary format is safe even when the data contains `:`, `|`, or other separator characters.
  2. If you support both `bytes` and `str`, add a small type tag before each encoded field so deserialization can restore the exact type.

Part 2: Persist the KV Store Across Multiple Segments

Extend the persistent key-value store so a snapshot can be split across multiple segment files. Implement the same store operations as in Part 1, but now `serialize(prefix)` must save the snapshot as a list of segment blobs, where each individual segment has size at most `max_segment_size` bytes. Segments may later be returned in a different order, so every segment must contain enough metadata to reconstruct the original snapshot correctly. Use a binary-safe encoding for keys and values; do not rely on delimiters. For testing, the fake disk stores a list of segment blobs under each `prefix`.

Constraints

  • `9 <= max_segment_size <= 10^6`
  • `0 <= len(operations) <= 10^4`
  • Total serialized snapshot size at any moment is at most `10^6` bytes
  • Every `deserialize(prefix)` and `segment_count(prefix)` refers to a prefix that was previously serialized
  • Every `reorder(prefix, order)` uses a valid permutation of the existing segment indices

Examples

Input: (15, [('put', b'a', b'12345'), ('put', b'b', b'67890'), ('serialize', 'p'), ('segment_count', 'p'), ('delete', b'a'), ('deserialize', 'p'), ('get', b'a'), ('get', b'b')])

Expected Output: [6, b'12345', b'67890']

Explanation: The snapshot spans multiple segments when the per-segment size limit is small.

Input: (20, [('put', b'aa', b'xx'), ('put', b'bb', b'yy'), ('serialize', 'snap'), ('segment_count', 'snap'), ('reorder', 'snap', [2, 0, 1]), ('delete', b'aa'), ('deserialize', 'snap'), ('get', b'aa'), ('get', b'bb')])

Expected Output: [3, b'xx', b'yy']

Explanation: Even after reordering the saved segments, deserialization must rebuild the correct store.

Input: (12, [('serialize', 'empty'), ('segment_count', 'empty'), ('put', b'a', b'b'), ('deserialize', 'empty'), ('get', b'a')])

Expected Output: [1, None]

Explanation: Edge case: an empty snapshot is still a valid serialized snapshot and fits in one segment.

Input: (13, [('put', b'k', b'abcdefghij'), ('serialize', 'big'), ('segment_count', 'big'), ('deserialize', 'big'), ('get', b'k')])

Expected Output: [5, b'abcdefghij']

Explanation: A single large value may need many segments; the restored value must be exact.

Hints

  1. A clean approach is to serialize the whole snapshot into one binary stream first, then split that stream into fixed-size payload chunks.
  2. Add a fixed-size header to each segment containing at least the segment index and the total number of segments.

Part 3: Add an Append-Only Mutation Log with Replay and Compaction

Implement a persistent key-value store that keeps two persisted artifacts on a fake disk: 1. a full snapshot of the store 2. an append-only log of later mutations The store starts empty. Every `put` and `delete` must update the in-memory map and also append a mutation record to the persisted log. A `restart` simulates process shutdown and startup: throw away the in-memory state, reload the snapshot, and replay the log in order. A full rewrite after every mutation is too expensive, so snapshot compaction should happen only when the pending log becomes large enough. In this problem, define the compaction policy as follows: if the number of pending log records becomes greater than or equal to `compact_threshold` immediately after a mutation, write a fresh snapshot of the current store and clear the log. Keys and values are either `bytes` or `str`, and the serialization format must be binary-safe.

Constraints

  • `1 <= compact_threshold <= 10^4`
  • `0 <= len(operations) <= 10^4`
  • Keys and values are only of type `bytes` or `str`
  • Total size of snapshot plus log at any moment is at most `10^6` bytes

Examples

Input: (3, [('put', 'a', '1'), ('put', 'b', '2'), ('restart',), ('get', 'a'), ('get', 'b'), ('status',)])

Expected Output: ['1', '2', (2, 2, 0)]

Explanation: With threshold 3, two mutations stay only in the log; restart must replay them correctly.

Input: (2, [('put', 'x', '1'), ('status',), ('put', 'y', '2'), ('status',), ('restart',), ('get', 'x'), ('get', 'y'), ('status',)])

Expected Output: [(1, 1, 0), (2, 0, 1), '1', '2', (2, 0, 1)]

Explanation: The second mutation reaches the threshold, so the log is compacted into a snapshot and cleared.

Input: (10, [('put', 'a', '1'), ('put', 'b', '2'), ('delete', 'a'), ('restart',), ('get', 'a'), ('get', 'b'), ('status',)])

Expected Output: [None, '2', (1, 3, 0)]

Explanation: Replay must apply the delete after the earlier put, leaving only `'b'`.

Input: (1, [('restart',), ('get', 'missing'), ('put', b'k', b'v'), ('status',), ('delete', b'k'), ('restart',), ('get', b'k'), ('status',)])

Expected Output: [None, (1, 0, 1), None, (0, 0, 2)]

Explanation: Edge case: threshold 1 means every mutation compacts immediately.

Hints

  1. Keep snapshot encoding and log encoding separate. The snapshot stores full state, while the log stores only mutations.
  2. On `restart`, rebuild by loading the snapshot first and then replaying each log record in order. After each new mutation, check whether it is time to compact.
Last updated: May 6, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement IP Address Arithmetic - OpenAI (hard)
  • Simulate Infection Spread on a Grid - OpenAI (hard)
  • Implement Social Follow Recommendations - OpenAI (medium)
  • Build a Compose Rating Card - OpenAI (medium)
  • Generate Data Labeling Schedules - OpenAI (medium)