Implement KV store serialization
Company: OpenAI
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Implement serialization and deserialization for an in‑memory key‑value store to and from a contiguous bytes buffer. Keys are UTF‑8 strings; values may be int64, float64, bool, UTF‑8 string, or nested maps. Specify and implement a binary format including a magic header and version, per‑entry type tags, length‑prefixing, endianness, checksums, and forward/backward compatibility. Provide encode(store)->bytes and decode(bytes)->store with O(n) time and minimal extra space. Discuss trade‑offs compared to text formats like JSON or MessagePack.
Quick Answer: This question evaluates skills in binary serialization and deserialization, data encoding and layout, type tagging, endianness, checksumming, versioning for forward/backward compatibility, and algorithmic efficiency and space usage for an in-memory key-value store; it falls under the Coding & Algorithms domain.
Implement a **serializer and deserializer** for an in-memory key-value store using a custom binary format.
## What to implement
Write a single function `solution(operation, data)` that dispatches on the string `operation`:
- `solution('encode', store)` → returns the serialized **`bytes`**.
- `solution('decode', buffer)` → returns the reconstructed **store** (a `dict`).
Any other `operation` value must raise an error.
## The store
`store` is a nested Python `dict` whose **keys are UTF-8 strings** and whose values are one of:
- **int64** — a Python `int` in signed 64-bit range
- **float64** — a Python `float`
- **bool**
- **UTF-8 string**
- another **nested dict** following the same rules
Because `bool` is a subclass of `int` in Python, treat `True`/`False` as **bool**, not as integers.
## Binary layout
Every encoded buffer has three parts:
1. **Header** (9 bytes)
- 4 bytes magic: `b'KVSB'`
- 1 byte version: `1`
- 4 bytes: little-endian **unsigned** body length
2. **Body**
- the root **map payload** (see below)
3. **Footer** (4 bytes)
- 4 bytes: little-endian **unsigned** checksum
**Checksum rule:** the checksum is the unsigned 32-bit sum of **every byte before the checksum field** (i.e. header + body), taken modulo `2^32`.
### Map payload format
- 4 bytes: little-endian **unsigned** entry count
- then each entry, emitted in the **dict's existing iteration order** (do **not** sort keys)
### Entry format
- 4 bytes: little-endian unsigned **key length**
- the key bytes (UTF-8)
- 1 byte: **type tag**
- 4 bytes: little-endian unsigned **value length**
- the value bytes
### Type tags
| Tag | Type | Value payload |
|-----|------|---------------|
| `1` | int64 | exactly 8 bytes, little-endian **signed** |
| `2` | float64 | exactly 8 bytes, little-endian IEEE-754 |
| `3` | bool | exactly 1 byte: `0` or `1` |
| `4` | UTF-8 string | raw UTF-8 bytes (any length) |
| `5` | nested map | another full map payload |
> All multi-byte integers in this format (body length, entry count, key length, value length, checksum) are **little-endian unsigned**, except the int64 value payload (tag `1`), which is little-endian **signed**.
## Decoder validation
The decoder must reject malformed input by raising an error. It must validate:
- the magic bytes and the version (**reject unsupported versions**);
- the total buffer length is exactly `9 + body_length + 4`;
- the checksum matches the recomputed sum;
- every length prefix stays within bounds (no truncated keys, tags, or values);
- keys and string values are valid UTF-8;
- per-type payload sizes (int64/float64 = 8 bytes, bool = 1 byte and value is `0` or `1`);
- the type tag is known (**reject unknown tags**);
- each nested map payload is consumed exactly.
## Constraints
- All keys are strings, all strings are UTF-8, and all integers fit in signed 64-bit range.
- Serialize entries in the dict's existing iteration order; do not sort keys.
- The total serialized size is at most `10^6` bytes, and nesting depth is at most `100`.
- Within a single map, keys are unique.
## Round-trip guarantee
Encoding and decoding are inverses: `solution('decode', solution('encode', store))` reproduces the original store, with each field read back in the same order it was written.
Constraints
- All keys are strings, all strings are UTF-8, and all integers fit in signed 64-bit range.
- Serialize entries in the dict's existing iteration order; do not sort keys.
- The total serialized size is at most 10^6 bytes, and nesting depth is at most 100.
- Within a map, keys are unique.
Examples
Input: ('encode', {})
Expected Output: b'KVSB\x01\x04\x00\x00\x00\x00\x00\x00\x00;\x01\x00\x00'
Explanation: An empty map body is just a 4-byte entry count of 0. The checksum is the sum of the header and body bytes modulo 2^32.
Input: ('decode', b'KVSB\x01\x04\x00\x00\x00\x00\x00\x00\x00;\x01\x00\x00')
Expected Output: {}
Explanation: This is the valid encoding of an empty root map, so decoding returns an empty dict.
Input: ('encode', {'pi': 1.5})
Expected Output: b'KVSB\x01\x17\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00pi\x02\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf8?\x6b\x03\x00\x00'
Explanation: The map has one entry with key 'pi', type tag 2 for float64, and the IEEE-754 little-endian bytes for 1.5.
Input: ('decode', b'KVSB\x01\x24\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00m\x05\x16\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00a\x01\x08\x00\x00\x00\xfe\xff\xff\xff\xff\xff\xff\xff\x48\x0a\x00\x00')
Expected Output: {'m': {'a': -2}}
Explanation: The outer map contains one nested map under key 'm'. The nested map contains key 'a' with the int64 value -2.
Input: ('decode', b'KVSB\x01\x1b\x00\x00\x00\x02\x00\x00\x00\x01\x00\x00\x00t\x03\x01\x00\x00\x00\x01\x01\x00\x00\x00u\x04\x02\x00\x00\x00\xc3\xa9\xb6\x03\x00\x00')
Expected Output: {'t': True, 'u': 'é'}
Explanation: This buffer contains a bool entry and a UTF-8 string entry. The decoder must interpret tag 3 as bool and tag 4 as UTF-8 text.
Hints
- Give every value a type tag and a length prefix. That makes recursive decoding simpler and leaves room for future type tags.
- While decoding, keep one index into the buffer and advance it carefully. For nested maps, decode only within the subrange described by that value's length.