How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a hard difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Databricks.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Databricks during technical interviews.

Optimize a SQL query plan tree | Databricks Coding Question

Optimize a SQL query plan tree

Company: Databricks

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Technical Screen

## Problem You are given an in-memory representation of a SQL query plan as a tree of nodes (a relational algebra plan). Each node is one of: - `Scan(table)` - `Filter(predicate, child)` - `Project(columns, child)` - `Join(join_type, condition, left, right)` Assume predicates are boolean expressions over columns, and `Join` condition is an equi-join or conjunction of expressions. ### Task Implement an **optimizer** function that takes the root of a query plan tree and returns a **semantically equivalent** plan that is “better” according to simple heuristics. You should implement at least two classic rule-based optimizations, such as: - **Predicate pushdown**: move `Filter` nodes as close to `Scan` as possible; split conjunctions and push parts down each side of a `Join` when the predicate only references columns from that side. - **Projection pruning**: remove unused columns early by pushing `Project` down and trimming `Scan` outputs. ### Requirements - The optimized plan must preserve query semantics. - The output plan should be a valid tree in the same node format. - Your code should handle nested combinations of `Filter`, `Project`, and `Join`. ### Notes - Assume schemas for tables are known (columns per table). - Predicates may reference columns from one side of a join or both sides. - Keep the problem language-agnostic; provide working logic rather than just describing rules.

Quick Answer: This question evaluates understanding of relational algebra, query optimization techniques, and tree-based plan transformations such as predicate pushdown and projection pruning.

You are given an in-memory SQL query plan represented as a tree of tuples. Implement `solution(plan)` to return a semantically equivalent but optimized plan. Use this exact node format: - `('Scan', table_name, columns)` - `('Filter', predicate, child)` - `('Project', columns, child)` - `('Join', join_type, condition, left, right)` All column names are fully qualified strings like `'employees.id'`. For this problem, expressions are simplified: - Predicates and join conditions are conjunctions of atomic expressions separated by `' AND '`. - No `OR`, no parentheses, and no aliases. - In the tests, joins are `inner` joins. Your optimizer must apply these deterministic rules: 1. Split every `Filter` predicate by `' AND '`. 2. For an `inner` join, push any atomic predicate that references only left-side columns into the left subtree, and similarly for the right side. 3. Any predicate above an `inner` join that references both sides should be merged into the join condition. 4. Atomic expressions already inside a join condition may also be pushed down if they reference only one side. 5. Prune columns aggressively: each subtree should keep only the columns needed for final output, pushed filters, and join conditions. This may shrink `Scan` column lists and insert `Project` nodes when temporary columns are needed internally. 6. Remove redundant `Filter` and `Project` nodes when they no longer change the plan. Scans in the input include the full set of columns currently available from that table. Your output must use the same tuple-based node format.

Constraints

1 <= number of plan nodes <= 200
All column references are fully qualified as `table.column`
Predicates and join conditions are conjunctions separated only by the exact string ` AND `
Test cases use `inner` joins

Examples

Input: ('Project', ['employees.name'], ('Filter', 'employees.salary > 100000 AND departments.region_id = 1', ('Join', 'inner', 'employees.dept_id = departments.id', ('Scan', 'employees', ['employees.id', 'employees.name', 'employees.dept_id', 'employees.salary']), ('Scan', 'departments', ['departments.id', 'departments.name', 'departments.region_id']))))

Expected Output: ('Project', ['employees.name'], ('Join', 'inner', 'employees.dept_id = departments.id', ('Filter', 'employees.salary > 100000', ('Scan', 'employees', ['employees.name', 'employees.dept_id', 'employees.salary'])), ('Filter', 'departments.region_id = 1', ('Scan', 'departments', ['departments.id', 'departments.region_id']))))

Explanation: The salary predicate is pushed to the employees side, the region predicate to the departments side, both scans are trimmed, and a final project keeps only the requested output column.

Input: ('Project', ['orders.id'], ('Filter', 'orders.amount > 50', ('Scan', 'orders', ['orders.id', 'orders.customer_id', 'orders.amount'])))

Expected Output: ('Project', ['orders.id'], ('Filter', 'orders.amount > 50', ('Scan', 'orders', ['orders.id', 'orders.amount'])))

Explanation: The scan keeps only `orders.id` for output and `orders.amount` for the filter. `orders.customer_id` is pruned away.

Input: ('Project', ['a.id', 'c.flag'], ('Filter', 'a.value > 10 AND c.flag = 1 AND a.id = c.id', ('Join', 'inner', 'b.c_id = c.id', ('Join', 'inner', 'a.b_id = b.id', ('Scan', 'a', ['a.id', 'a.b_id', 'a.value', 'a.extra']), ('Scan', 'b', ['b.id', 'b.c_id', 'b.payload'])), ('Scan', 'c', ['c.id', 'c.flag', 'c.note']))))

Expected Output: ('Project', ['a.id', 'c.flag'], ('Join', 'inner', 'b.c_id = c.id AND a.id = c.id', ('Project', ['a.id', 'a.value', 'b.c_id'], ('Join', 'inner', 'a.b_id = b.id', ('Filter', 'a.value > 10', ('Scan', 'a', ['a.id', 'a.b_id', 'a.value'])), ('Scan', 'b', ['b.id', 'b.c_id']))), ('Filter', 'c.flag = 1', ('Scan', 'c', ['c.id', 'c.flag']))))

Explanation: Left-only and right-only predicates are pushed to the leaves. The cross-side predicate `a.id = c.id` is merged into the outer join condition. Extra columns needed only for internal joins are removed by inserted projects.

Input: ('Scan', 't', ['t.a', 't.b'])

Expected Output: ('Scan', 't', ['t.a', 't.b'])

Explanation: Edge case: a single scan with nothing to optimize should remain unchanged.

Input: ('Join', 'inner', 'x.a = y.a AND x.b > 0', ('Scan', 'x', ['x.a', 'x.b', 'x.c']), ('Scan', 'y', ['y.a', 'y.d']))

Expected Output: ('Join', 'inner', 'x.a = y.a', ('Filter', 'x.b > 0', ('Scan', 'x', ['x.a', 'x.b', 'x.c'])), ('Scan', 'y', ['y.a', 'y.d']))

Explanation: The single-side condition `x.b > 0` is pushed out of the join condition and applied directly on the left input.

Hints

Think top-down: every subtree only needs to keep the columns required by its parent, plus any columns needed to evaluate filters or join conditions.
Split conjunctions into atomic predicates, then compare the referenced columns with the columns available on the left and right side of a join.

Quick Overview

This question evaluates understanding of relational algebra, query optimization techniques, and tree-based plan transformations such as predicate pushdown and projection pruning.