How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a hard difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at OpenAI during technical interviews.

Implement a persistent sharded key-value store

Company: OpenAI

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Technical Screen

## Problem Implement a simple **key–value store** that persists data on disk. You must store the data in **fixed-size shards**, where each shard is saved in **one file**. If a shard file reaches its maximum size, new writes must go into a **new shard file**. The store must support **shutdown (persist)** and **restore (recover)**. You are given helper functions: - `encode(key, value) -> bytes` to serialize a key/value record - `decode(bytes) -> (key, value)` to deserialize a record Assume keys and values are strings (or byte arrays), and `encode/decode` are inverses for valid records. ## Required API Design and implement (language-agnostic) functions/methods equivalent to: - `put(key, value)` - Insert or overwrite the value for `key`. - `get(key) -> value | null` - Return the current value for `key`, or `null`/`None` if missing. - `delete(key)` (optional if you want to support removals) - Remove `key` if it exists. - `shutdown()` - Ensure all in-memory state is persisted so the store can be restored later. - `restore(directory_path)` (or constructor-based restore) - Load persisted data from disk and make the store usable again. ## Storage requirements - Data is stored across **one or more files** in a directory. - Each file is a **shard** with a configurable maximum size `SHARD_SIZE_BYTES`. - Writes append records to the current shard until it would exceed the shard size; then create a new shard file. ## What to discuss / clarify - Your on-disk layout (file naming, record boundaries, handling partial/corrupt tail). - How you locate the latest value for a key after many overwrites. - What metadata or in-memory index (if any) you maintain. - Complexity of `put/get` and `restore`. - Edge cases: empty store, large values, shard rollover, overwrite semantics, optional delete semantics.

Quick Answer: This question evaluates understanding of persistent storage and file-based sharding, including on-disk layout, record serialization, indexing strategies for locating the latest value per key, shard rollover, and recovery from partial or corrupt tails.

Implement a simulation of a persistent sharded key-value store. Instead of real files, each shard file is represented by a string. Supported operations: - ('put', key, value): append a put record and set the current value - ('get', key): return the current value or None - ('delete', key): if the key exists, append a delete record and remove it; if the key is missing, do nothing - ('shutdown',): included for API completeness; it does nothing in this simulation because every write is already persisted to the shard strings - ('restore',): discard the in-memory map and rebuild it by replaying the shard contents from first shard to last shard Encoding rules: - put record: 'P|key|value;' - delete record: 'D|key;' The size of a record is its string length. Keys and values are ASCII, so characters and bytes are the same. Sharding rule: Append a new record to the last shard unless that would make the shard longer than shard_size. If it would exceed shard_size, create a new shard and write the record there. Restore rule: Replay shards in creation order. If a shard ends with an incomplete fragment that is not terminated by ';', ignore that trailing fragment. Return a tuple (get_results, shards), where get_results contains the results of all get operations in order, and shards is the final list of shard contents.

Constraints

0 <= len(operations) <= 20000
1 <= shard_size <= 10000
Keys and values are non-empty ASCII strings containing only letters and digits, so they never contain '|' or ';'
The encoded length of every single put or delete record is at most shard_size

Examples

Input: (10, [('put', 'a', '1'), ('put', 'b', '22'), ('get', 'a'), ('put', 'a', '333'), ('get', 'a'), ('restore',), ('get', 'b')])

Expected Output: (['1', '333', '22'], ['P|a|1;', 'P|b|22;', 'P|a|333;'])

Explanation: Each write would overflow the current shard, so three shards are created. After restore, the latest value of 'a' is '333' and 'b' is still '22'.

Input: (20, [('put', 'ab', 'x'), ('put', 'c', 'yz'), ('delete', 'ab'), ('get', 'ab'), ('restore',), ('get', 'c'), ('get', 'ab')])

Expected Output: ([None, 'yz', None], ['P|ab|x;P|c|yz;D|ab;'])

Explanation: All records fit in one shard. The delete operation writes a tombstone for 'ab', so it is missing both before and after restore.

Input: (10, [('restore',), ('get', 'missing'), ('shutdown',)])

Expected Output: ([None], [])

Explanation: Restoring an empty store leaves it empty, and shutdown does not change anything in this simulation.

Input: (13, [('put', 'a', '1'), ('put', 'b', '22'), ('put', 'c', '3'), ('get', 'c'), ('restore',), ('get', 'a')])

Expected Output: (['3', '1'], ['P|a|1;P|b|22;', 'P|c|3;'])

Explanation: The first two records exactly fill the first shard: 6 + 7 = 13. The third record starts a new shard. Restore rebuilds the same state.

Input: (25, [('delete', 'ghost'), ('put', 'z', '0'), ('restore',), ('get', 'ghost'), ('get', 'z')])

Expected Output: ([None, '0'], ['P|z|0;'])

Explanation: Deleting a missing key is a no-op and does not create a record. Only the put for 'z' is stored.

Solution

def solution(shard_size, operations):
    shards = []
    index = {}
    get_results = []

    def append_record(record):
        if not shards or len(shards[-1]) + len(record) > shard_size:
            shards.append(record)
        else:
            shards[-1] += record

    def rebuild_index():
        restored = {}
        for shard in shards:
            for part in shard.split(';')[:-1]:
                if not part:
                    continue
                if part.startswith('P|'):
                    pieces = part.split('|', 2)
                    if len(pieces) == 3:
                        _, key, value = pieces
                        restored[key] = value
                elif part.startswith('D|'):
                    pieces = part.split('|', 1)
                    if len(pieces) == 2:
                        _, key = pieces
                        restored.pop(key, None)
        return restored

    for op in operations:
        kind = op[0]

        if kind == 'put':
            _, key, value = op
            record = f'P|{key}|{value};'
            append_record(record)
            index[key] = value

        elif kind == 'get':
            _, key = op
            get_results.append(index.get(key))

        elif kind == 'delete':
            _, key = op
            if key in index:
                record = f'D|{key};'
                append_record(record)
                index.pop(key, None)

        elif kind == 'shutdown':
            pass

        elif kind == 'restore':
            index = rebuild_index()

        else:
            raise ValueError('Unknown operation: ' + str(kind))

    return (get_results, shards)

Time complexity: O(total_bytes_written + total_bytes_scanned_on_restore). Space complexity: O(total_bytes_in_shards + number_of_live_keys).

Hints

Use an in-memory hash map for the current state so that each get is O(1) on average.
Treat the shard contents as an append-only log. On restore, replay records from the first shard to the last; the latest record for a key wins.

Quick Overview

This question evaluates understanding of persistent storage and file-based sharding, including on-disk layout, record serialization, indexing strategies for locating the latest value per key, shard rollover, and recovery from partial or corrupt tails.