Assess SQL joins, unions, windows, dedup, and pandas

Q: Assess SQL joins, unions, windows, dedup, and pandas

This is a Data Manipulation (SQL/Python) interview question from Fannie Mae for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

Using the schema and sample data below, answer all sub-questions. Unless stated, assume ANSI SQL and explain any dialect-specific choices.

Schema:

customers(customer_id PK, name)
orders(order_id PK, customer_id FK->customers, order_date DATE, amount DECIMAL(10,2))
order_items(item_id PK, order_id FK->orders, product_id FK->products, qty INT, price DECIMAL(10,2))
products(product_id PK, name, category)
web_orders(order_id, customer_id, amount)
store_orders(order_id, customer_id, amount)
employees(emp_id PK, name, salary INT)

Sample tables (tiny): customers

customer_id	name
1	Alice
2	Bob
3	Carol

orders

order_id	customer_id	order_date	amount
101	1	2025-08-28	120.00
102	1	2025-08-30	80.00
103	2	2025-08-29	50.00
105	3	2025-08-30	NULL

order_items

item_id	order_id	product_id	qty	price
1	101	10	1	120.00
2	102	11	2	40.00
3	103	12	1	50.00

products

product_id	name	category
10	Phone	Electronics
11	Cable	Electronics
12	Book	Media

web_orders

order_id	customer_id	amount
201	1	50.00
202	1	50.00
202	1	50.00

store_orders

order_id	customer_id	amount
301	1	50.00
202	1	50.00
302	2	50.00

employees

emp_id	name	salary
1	Alice	100
2	Bob	200
3	Carol	300
4	Dave	300

Tasks:

Aggregations and NULLs: Write queries that compute per-customer: (a) COUNT( ), COUNT(amount), SUM(amount), AVG(amount), MIN/MAX(order_date). Show the result difference caused by orders.amount being NULL for customer_id=3. Explain COUNT(column) vs COUNT( ), and how AVG ignores NULLs. Also compute total revenue from order_items using SUM(qty*price) and reconcile it to SUM(orders.amount); explain any mismatch and how to detect inconsistencies.
Joins: (a) INNER JOIN customers↔orders to list customers who placed orders. (b) LEFT JOIN customers↔orders to include customers with zero orders; return 0 as totals for those. (c) FULL OUTER JOIN customers↔orders and explain which rows appear only on one side. (d) CROSS JOIN products↔(SELECT DISTINCT category) to illustrate Cartesian output and how to limit it. Provide queries and one or two expected result rows to prove understanding.
UNION vs UNION ALL: Combine web_orders and store_orders into a single set of (order_id, customer_id, amount). (a) Using UNION ALL, count total rows and the number of duplicate order_ids. (b) Using UNION (distinct), show deduplicated rows. (c) Return only the order_ids that appear in both sources without using INTERSECT.
Window functions: (a) For each customer, rank their orders by amount DESC and return the top 1 per customer using ROW_NUMBER. (b) Compute a running total of amount per customer ordered by order_date; show partition boundaries. (c) For August 2025, compute each customer’s rank by monthly spend using SUM(amount) OVER(PARTITION BY customer_id) and then DENSE_RANK across customers. Explain why a window does not collapse rows like GROUP BY.
View vs table: Create a view top_spenders_2025_08(customer_id, total_amount) over orders. Is it updatable? Under what conditions? What happens if the base table adds a NOT NULL column or if the view definition references a column later dropped? Discuss pros/cons of views vs materialized tables for this case and how to refresh a materialized view (name your chosen RDBMS).
Hide duplicated rows (select-time): Return one row per distinct (order_id, customer_id, amount) from web_orders without deleting data. Show two approaches: DISTINCT and ROW_NUMBER() filtering; discuss performance and which preserves deterministic choice.
Remove duplicated rows (data-change): In web_orders, delete true duplicates keeping the lowest ROW_NUMBER by (order_id, customer_id, amount) and explain how you’d do this safely in a transaction with a reproducible tie-breaker. Provide the DELETE…CTE you would run.
LC #177 (Nth highest salary): Using employees(name, salary), write a query that returns the Nth highest distinct salary given a parameter :n. Show outputs for n=1,2,3 on the sample data and explain behavior when n exceeds the number of distinct salaries.
Improve SQL efficiency: Propose an indexing strategy for the above workloads (consider orders(customer_id, order_date), order_items(order_id), and covering indexes). Show how to rewrite one query to be sargable (e.g., avoid functions on indexed columns), and explain when EXISTS outperforms IN and when a window function can replace a self-join. Describe how you’d verify improvements using an execution plan and timing.
Python pandas merge vs join vs concat: Given DataFrames below, specify the exact pandas calls (with how=, on=, axis=, ignore_index=, validate=) to: (a) left-join spend onto cities; (b) inner-join both then right-join to keep ids only in df_b; (c) stack df_a and a new rows frame vertically; (d) align on index instead of column; (e) explain when concat vs merge is appropriate.

DataFrames: df_a

id	city
1	SF
2	NYC
3	LA

df_b

id	spend
2	80
3	20
4	10

df_c

id	tag
3	vip

Assess SQL joins, unions, windows, dedup, and pandas

Comments (0)