Design and Implement a Mini SQL Query Engine
Company: Retell
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
You're given a set of in-memory tables — each table is a collection of rows over a known, typed schema (e.g., a table is a list of row objects, and each column has a type such as integer, string, or null). Design and implement a small **SQL query engine**: it takes a SQL query string, executes it against the in-memory tables, and returns the result set.
Start from a basic subset of SQL and be ready to extend it as the interviewer adds features. You cannot implement all of SQL in an interview window, so an explicit part of this exercise is **scoping**: decide which subset to support first, state your boundaries out loud, confirm them with the interviewer, and design so that new clauses slot in cleanly rather than forcing a rewrite.
### Constraints & Assumptions
- Tables are fully in memory (e.g., `tables[name] = [ {col: value, ...}, ... ]`) with a known schema (column names + types). Data fits in memory.
- Read-only queries (`SELECT`); no transactions, no persistence, no writes in the base version.
- **v1 target:** `SELECT <columns> FROM <table> WHERE <predicate>`, where the predicate supports comparison operators (`=`, `!=`, `<`, `<=`, `>`, `>=`) combined with `AND` / `OR`.
- **Extensions to be ready for:** `ORDER BY` + `LIMIT`, aggregate functions (`COUNT`, `SUM`, `AVG`, `MIN`, `MAX`) with `GROUP BY`, and `INNER JOIN` of two tables.
- Keywords may be treated case-insensitively; assume a simplified, well-formed grammar unless validation is explicitly requested.
### Clarifying Questions to Ask
- Which subset is in scope for v1, and how far do we extend — joins, grouping, subqueries? Where should I stop?
- What column types exist, and how are `NULL`s handled in comparisons and in aggregates (SQL three-valued logic, or a simplified rule)?
- Should I validate the query against the schema (error on unknown table/column), or assume valid input?
- Is the input always a single statement? Can I assume a simplified grammar (no nested expressions in the select list to start)?
- What is the expected output shape — an ordered list of rows with a defined column order — and how are `ORDER BY` ties broken?
### Part 1 — Architecture: how the engine is structured
Before writing code, lay out the engine's stages and the boundaries between them. What are the components, what does each one consume and produce, and where would future features (JOIN, GROUP BY) plug in?
```hint Pipeline
Model it as a classic query pipeline: **tokenizer (lexer) → parser → AST / structured query → executor** over the in-memory tables. Each stage has a narrow contract, so adding a clause means extending the grammar and adding one execution step — not rewriting everything.
```
```hint Make it extensible
Represent execution as **composable operators** — `Scan`, `Filter`, `Project`, `Join`, `Aggregate`, `Sort`, `Limit` — assembled into a tree (the "volcano"/iterator model). New SQL features become new operators, which is exactly how you absorb the interviewer's escalating requirements.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Implement the core: SELECT … FROM … WHERE
Implement the base engine: parse and execute `SELECT col1, col2 FROM table WHERE <predicate>` over the in-memory tables, supporting the comparison operators and `AND` / `OR` in the predicate. Show both the parsing and the execution.
```hint Parsing
A hand-written **recursive-descent parser** is plenty. Parse the clause keywords (`SELECT` / `FROM` / `WHERE`) into a structured query object, and parse the `WHERE` expression into a small **predicate tree**: comparison leaves, `AND`/`OR` internal nodes. Mind operator precedence — `AND` binds tighter than `OR`.
```
```hint Execution
Scan the table's rows, evaluate the predicate tree against each row to filter, then project the requested columns. Keep predicate evaluation a single recursive function over the expression tree.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Extend: ORDER BY, aggregation/GROUP BY, and JOIN
The interviewer keeps adding features. Extend the engine to support `ORDER BY` + `LIMIT`, aggregate functions with `GROUP BY`, and an `INNER JOIN` of two tables. Explain how each maps onto your operator model, and where you would set boundaries given limited time.
```hint Map clauses to operators
`ORDER BY` → a `Sort` operator on the projected rows; `LIMIT` → a `Limit` operator; `GROUP BY` + aggregates → an `Aggregate` operator that buckets rows by the group key and folds each bucket; `INNER JOIN` → a `Join` operator (nested-loop to start; hash-join if asked for performance).
```
```hint Ordering of operations
Respect SQL's **logical evaluation order**: `FROM`/`JOIN` → `WHERE` → `GROUP BY` → aggregate → (`HAVING`) → `SELECT`/project → `ORDER BY` → `LIMIT`. Getting this order right is what makes `GROUP BY` combined with `ORDER BY` behave correctly.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- How would you support a `JOIN` of three or more tables, and how would join order affect performance?
- How would you add `HAVING` and subqueries (e.g., `WHERE col IN (SELECT ...)`) to your AST and executor?
- Where would a query optimizer fit, and what is one rewrite (e.g., predicate pushdown) that would help?
- How would you test this engine so you stay confident it's correct as you keep adding features?
Quick Answer: This question evaluates a candidate's ability to design and build a small SQL query engine, covering tokenization, parsing, and query execution over in-memory tables. It tests system design and compiler-style thinking, including how to scope a large problem into an extensible v1 and reason about extending it with sorting, aggregation, and joins. This is a system design and practical implementation question assessing both architectural judgment and coding ability.