Build a File-Backed Key-Value Store
Company: OpenAI
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's understanding of persistent key-value storage, in-memory versus on-disk state management, serialization/deserialization, file I/O, and state reconstruction across multiple files, with additional considerations for concurrency and crash recovery.
Part 1: Single-File Persistent Key-Value Store
Constraints
- 0 <= total number of operations across both phases <= 10^4
- Keys and values are strings with length from 0 to 100
- Operations are valid and use one of: put, get, delete
- Final output must reflect the state after serialize -> restart -> deserialize
Examples
Input: ([('put', 'a', '1'), ('put', 'b', '2'), ('get', 'a')], [('get', 'b'), ('delete', 'a'), ('get', 'a')])
Expected Output: (['1', '2', None], [('b', '2')])
Explanation: The values for 'a' and 'b' survive the restart. After deleting 'a', a later get returns None.
Input: ([('put', '', 'x'), ('put', 'k', 'old'), ('put', 'k', 'new')], [('get', ''), ('get', 'k')])
Expected Output: (['x', 'new'], [('', 'x'), ('k', 'new')])
Explanation: Tests empty-string keys and overwriting an existing key before serialization.
Input: ([], [('get', 'missing'), ('delete', 'missing')])
Expected Output: ([None], [])
Explanation: Edge case: no data is stored before restart, and missing keys should return None.
Input: ([('put', 'x', '1'), ('delete', 'x')], [('get', 'x')])
Expected Output: ([None], [])
Explanation: A key deleted before serialization should not reappear after deserialization.
Hints
- Use a dictionary for the in-memory store so put, get, and delete are efficient.
- For persistence, any reversible representation works; serializing the entire dictionary to JSON is enough.
Part 2: Multi-File Persistent Key-Value Store with File Size Limits
Constraints
- 1 <= max_file_size <= 10^5
- 0 <= total number of operations across both phases <= 10^4
- Keys and values do not contain '|' or newline characters
- Every individual log record fits within max_file_size
- Replay order is exactly from the first file to the last file, and from top to bottom within each file
Examples
Input: (10, [('put', 'a', '1'), ('put', 'bb', '2'), ('get', 'a')], [('get', 'bb'), ('delete', 'a'), ('get', 'a'), ('put', 'c', '3')])
Expected Output: (['1', '2', None], ['P|a|1\n', 'P|bb|2\n', 'D|a\nP|c|3\n'], [('bb', '2'), ('c', '3')])
Explanation: The second put does not fit in the first file, so a new file is created. After restart, recovery replays both files before new operations continue.
Input: (8, [('put', 'aa', 'b'), ('put', 'x', 'y')], [('get', 'aa'), ('put', 'aa', 'c'), ('get', 'aa')])
Expected Output: (['b', 'c'], ['P|aa|b\n', 'P|x|y\n', 'P|aa|c\n'], [('aa', 'c'), ('x', 'y')])
Explanation: Overwriting a key after restart adds a new log record in a later file, and replaying all files yields the newest value.
Input: (10, [('put', 'a', '1'), ('delete', 'a')], [('get', 'a')])
Expected Output: ([None], ['P|a|1\nD|a\n'], [])
Explanation: Edge case: the delete record fits exactly into the remaining space of the first file, and recovery correctly removes the key.
Input: (12, [('put', 'a', '1')], [('put', 'b', '2'), ('get', 'a'), ('get', 'b')])
Expected Output: (['1', '2'], ['P|a|1\nP|b|2\n'], [('a', '1'), ('b', '2')])
Explanation: After restart, new records should keep using the last file if there is still space.
Input: (5, [], [('get', 'z')])
Expected Output: ([None], [], [])
Explanation: Edge case: no persisted records at all.
Hints
- Think of the disk as an append-only log: only put and delete need to be persisted.
- During recovery, replay every record from oldest to newest; later operations override earlier ones.