Optimize a SQL query plan tree
Company: Databricks
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Technical Screen
Quick Answer: This question evaluates understanding of relational algebra, query optimization techniques, and tree-based plan transformations such as predicate pushdown and projection pruning.
Constraints
- 1 <= number of plan nodes <= 200
- All column references are fully qualified as `table.column`
- Predicates and join conditions are conjunctions separated only by the exact string ` AND `
- Test cases use `inner` joins
Examples
Input: ('Project', ['employees.name'], ('Filter', 'employees.salary > 100000 AND departments.region_id = 1', ('Join', 'inner', 'employees.dept_id = departments.id', ('Scan', 'employees', ['employees.id', 'employees.name', 'employees.dept_id', 'employees.salary']), ('Scan', 'departments', ['departments.id', 'departments.name', 'departments.region_id']))))
Expected Output: ('Project', ['employees.name'], ('Join', 'inner', 'employees.dept_id = departments.id', ('Filter', 'employees.salary > 100000', ('Scan', 'employees', ['employees.name', 'employees.dept_id', 'employees.salary'])), ('Filter', 'departments.region_id = 1', ('Scan', 'departments', ['departments.id', 'departments.region_id']))))
Explanation: The salary predicate is pushed to the employees side, the region predicate to the departments side, both scans are trimmed, and a final project keeps only the requested output column.
Input: ('Project', ['orders.id'], ('Filter', 'orders.amount > 50', ('Scan', 'orders', ['orders.id', 'orders.customer_id', 'orders.amount'])))
Expected Output: ('Project', ['orders.id'], ('Filter', 'orders.amount > 50', ('Scan', 'orders', ['orders.id', 'orders.amount'])))
Explanation: The scan keeps only `orders.id` for output and `orders.amount` for the filter. `orders.customer_id` is pruned away.
Input: ('Project', ['a.id', 'c.flag'], ('Filter', 'a.value > 10 AND c.flag = 1 AND a.id = c.id', ('Join', 'inner', 'b.c_id = c.id', ('Join', 'inner', 'a.b_id = b.id', ('Scan', 'a', ['a.id', 'a.b_id', 'a.value', 'a.extra']), ('Scan', 'b', ['b.id', 'b.c_id', 'b.payload'])), ('Scan', 'c', ['c.id', 'c.flag', 'c.note']))))
Expected Output: ('Project', ['a.id', 'c.flag'], ('Join', 'inner', 'b.c_id = c.id AND a.id = c.id', ('Project', ['a.id', 'a.value', 'b.c_id'], ('Join', 'inner', 'a.b_id = b.id', ('Filter', 'a.value > 10', ('Scan', 'a', ['a.id', 'a.b_id', 'a.value'])), ('Scan', 'b', ['b.id', 'b.c_id']))), ('Filter', 'c.flag = 1', ('Scan', 'c', ['c.id', 'c.flag']))))
Explanation: Left-only and right-only predicates are pushed to the leaves. The cross-side predicate `a.id = c.id` is merged into the outer join condition. Extra columns needed only for internal joins are removed by inserted projects.
Input: ('Scan', 't', ['t.a', 't.b'])
Expected Output: ('Scan', 't', ['t.a', 't.b'])
Explanation: Edge case: a single scan with nothing to optimize should remain unchanged.
Input: ('Join', 'inner', 'x.a = y.a AND x.b > 0', ('Scan', 'x', ['x.a', 'x.b', 'x.c']), ('Scan', 'y', ['y.a', 'y.d']))
Expected Output: ('Join', 'inner', 'x.a = y.a', ('Filter', 'x.b > 0', ('Scan', 'x', ['x.a', 'x.b', 'x.c'])), ('Scan', 'y', ['y.a', 'y.d']))
Explanation: The single-side condition `x.b > 0` is pushed out of the join condition and applied directly on the left input.
Hints
- Think top-down: every subtree only needs to keep the columns required by its parent, plus any columns needed to evaluate filters or join conditions.
- Split conjunctions into atomic predicates, then compare the referenced columns with the columns available on the left and right side of a join.