In-Memory Databases And Query Processing

What's being tested

Interviewers are probing whether you can design and implement the core of an in-memory database engine: data representation, indexing, query execution, concurrency, and durability tradeoffs. The strongest answers combine practical API design with low-level data-structure reasoning: how insert, get, filter, sort, project, and delete behave under realistic workloads. OpenAI cares because many backend systems need fast, correct state management under high concurrency, whether for serving metadata, caching computed results, routing requests, or powering internal tools. You are expected to reason like a Software Engineer: clear abstractions, complexity analysis, correctness under edge cases, and incremental design from simple implementation to production-ready system.

Core knowledge

Data model comes first: decide whether the system is a key-value store, row-oriented table, or SQL-like relational table. A key-value API may expose put(key, value), get(key), delete(key), while a table API needs schemas, columns, row IDs, predicates, projections, and ordering.
Row store vs column store is a core tradeoff. A row-oriented layout stores each record together, which is efficient for point lookups and updates. A column-oriented layout stores each field separately, improving scans, projections, and vectorized filtering, as seen in systems like DuckDB.
Primary storage is often a HashMap<PrimaryKey, Row> for point reads with expected $O(1)$ lookup. For ordered range queries, use a balanced tree, B-tree, red-black tree, or skip list, giving $O(\log n)$ lookup and efficient ordered iteration.
Secondary indexes map column values to row IDs: Map<ColumnValue, Set<RowId>> for equality predicates, or TreeMap<ColumnValue, Set<RowId>> for range predicates. Inserts and updates become more expensive because every affected index must be maintained, usually $O(k)$ to $O(k \log n)$ for $k$ indexed columns.
Query execution can be modeled as a pipeline: scan, filter, project, sort, limit, and aggregate. Without indexes, predicate filtering is $O(n)$ . With a selective index, lookup may reduce work to $O(\log n + m)$ where $m$ is matching rows.
Predicate pushdown means applying filters before projection, sorting, or joining to reduce intermediate data size. For example, execute WHERE age > 30 before ORDER BY name, because sorting all rows costs $O(n \log n)$ while sorting filtered rows costs $O(m \log m)$ .
Ordering and tie-breaking must be deterministic. If a query asks for lexicographic sorting by last_name, then first_name, and then insertion order, your comparator must encode all three keys. Stable sorting can preserve insertion order, but explicit tie-breakers are safer.
Query planning does not need a full Postgres optimizer in an interview, but you should describe simple heuristics: use an index if a predicate targets an indexed column, choose the most selective predicate first, intersect row-ID sets for AND, union sets for OR, and fall back to table scan.
Concurrency control depends on consistency requirements. A simple implementation may use a single read-write lock: concurrent reads allowed, writes exclusive. Higher performance designs use table-level locks, row-level locks, or MVCC where readers see immutable snapshots while writers create new versions.
Transactions require defining guarantees: atomicity, isolation, and durability. For interview scope, you can discuss single-key atomic operations, then multi-operation transactions using a transaction log, per-row locks, or optimistic concurrency with version checks. Full serializability is harder than many candidates imply.
Durability is optional for a pure in-memory system but commonly added via write-ahead logging and snapshots. A Redis-like design might append commands to an AOF log and periodically create snapshots. Recovery replays the log after loading the latest snapshot.
Memory management matters because everything lives in RAM. Estimate footprint as approximately rows * (row overhead + field sizes + index overhead). Indexes can double or triple memory usage. For tens of millions of rows, object-heavy layouts in languages like Java or Python may become the bottleneck.

Worked example

For “Design an in-memory database”, a strong candidate starts by clarifying scope: “Is this single-node or distributed? Do we need SQL, or only key-value operations? Are queries mostly point lookups, range scans, or ad hoc filters? Is durability required after process crash?” Then they declare assumptions: single-node, memory-resident table storage, simple SQL-like predicates, optional durability, and concurrent clients.

The answer should be organized around four pillars. First, define the API: createTable, insert, update, delete, query, and perhaps beginTransaction if transactions are required. Second, describe storage: a table has a schema, an append-only row array or row-ID map, and primary-key lookup through a hash index. Third, explain query execution: parse a query object, choose an index if available, evaluate predicates, project columns, then sort and limit. Fourth, address correctness properties: locking, transaction boundaries, recovery, and deterministic ordering.

A specific tradeoff to flag is hash index vs ordered index. A hash index is excellent for WHERE id = ?, but cannot efficiently answer WHERE timestamp BETWEEN a AND b ORDER BY timestamp; a TreeMap or skip list supports range scans but costs more per write. You can propose starting with hash indexes and adding ordered secondary indexes when range queries become a requirement.

A good close is: “If I had more time, I’d add a simple cost-based planner using index cardinality statistics, plus snapshot isolation for readers and a write-ahead log for crash recovery.” That shows you know the difference between a toy implementation and a production system without overengineering the first version.

A second angle

For “Implement in-memory DB querying”, the focus shifts from broad architecture to execution mechanics and clean code. Instead of talking first about durability or replication, start with an internal representation for rows, predicates, projections, and ordering clauses. The main challenge is composing operations correctly: filter before sort, project after filtering, apply limit last unless the ordering allows a top- $k$ optimization. A strong solution mentions complexity for each path: full scan is $O(n)$ , sorting is $O(m \log m)$ , and indexed equality lookup is $O(1 + m)$ expected time. The same database principles apply, but the interviewer is likely evaluating implementation clarity, edge cases, and whether your abstractions can evolve into a real query engine.

Common pitfalls

Pitfall: Treating an in-memory database like a plain HashMap.

A HashMap is a good starting point for key-value reads, but it does not solve filtering, range queries, ordering, projections, schema validation, or transactions. A better answer starts simple, then explicitly layers indexes, query operators, and concurrency control as requirements expand.

Pitfall: Jumping to distributed systems too early.

Many candidates immediately discuss sharding, replication, and consensus before defining the single-node API or storage layout. That makes the answer feel unfocused. Nail the local engine first: row representation, indexes, query path, locking, and durability. Only then discuss partitioning if the interviewer asks for scale beyond one machine.

Pitfall: Hand-waving consistency with “use locks.”

“Use locks” is incomplete unless you state lock granularity and the behavior readers observe during writes. Say whether you use a global mutex, read-write lock, row locks, or MVCC, and describe the tradeoff: simplicity and safety versus throughput and implementation complexity.

Connections

Interviewers may pivot from this topic into caching, especially Redis-style TTLs, eviction policies, and cache invalidation. They may also move toward database internals, including B-trees, query optimizers, MVCC, and write-ahead logging, or toward distributed storage topics like sharding, leader replication, and consistency models.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts