Assess SQL joins, unions, windows, dedup, and pandas

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Fannie Mae.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Fannie Mae during technical interviews.

Question

Using the schema and sample data below, answer all sub-questions. Unless stated, assume ANSI SQL and explain any dialect-specific choices.

Schema:
- customers(customer_id PK, name)
- orders(order_id PK, customer_id FK->customers, order_date DATE, amount DECIMAL(10,2))
- order_items(item_id PK, order_id FK->orders, product_id FK->products, qty INT, price DECIMAL(10,2))
- products(product_id PK, name, category)
- web_orders(order_id, customer_id, amount)
- store_orders(order_id, customer_id, amount)
- employees(emp_id PK, name, salary INT)

Sample tables (tiny):
customers
| customer_id | name |
|------------:|-------|
| 1 | Alice |
| 2 | Bob |
| 3 | Carol |

orders
| order_id | customer_id | order_date | amount |
|---------:|------------:|-------------|---------|
| 101 | 1 | 2025-08-28 | 120.00 |
| 102 | 1 | 2025-08-30 | 80.00 |
| 103 | 2 | 2025-08-29 | 50.00 |
| 105 | 3 | 2025-08-30 | NULL |

order_items
| item_id | order_id | product_id | qty | price |
|--------:|---------:|-----------:|----:|--------|
| 1 | 101 | 10 | 1 | 120.00 |
| 2 | 102 | 11 | 2 | 40.00 |
| 3 | 103 | 12 | 1 | 50.00 |

products
| product_id | name | category |
|-----------:|-------|--------------|
| 10 | Phone | Electronics |
| 11 | Cable | Electronics |
| 12 | Book | Media |

web_orders
| order_id | customer_id | amount |
|---------:|------------:|-------:|
| 201 | 1 | 50.00 |
| 202 | 1 | 50.00 |
| 202 | 1 | 50.00 |

store_orders
| order_id | customer_id | amount |
|---------:|------------:|-------:|
| 301 | 1 | 50.00 |
| 202 | 1 | 50.00 |
| 302 | 2 | 50.00 |

employees
| emp_id | name | salary |
|------:|-------|-------:|
| 1 | Alice | 100 |
| 2 | Bob | 200 |
| 3 | Carol | 300 |
| 4 | Dave | 300 |

Tasks:
1) Aggregations and NULLs: Write queries that compute per-customer: (a) COUNT(*), COUNT(amount), SUM(amount), AVG(amount), MIN/MAX(order_date). Show the result difference caused by orders.amount being NULL for customer_id=3. Explain COUNT(column) vs COUNT(*), and how AVG ignores NULLs. Also compute total revenue from order_items using SUM(qty*price) and reconcile it to SUM(orders.amount); explain any mismatch and how to detect inconsistencies.
2) Joins: (a) INNER JOIN customers↔orders to list customers who placed orders. (b) LEFT JOIN customers↔orders to include customers with zero orders; return 0 as totals for those. (c) FULL OUTER JOIN customers↔orders and explain which rows appear only on one side. (d) CROSS JOIN products↔(SELECT DISTINCT category) to illustrate Cartesian output and how to limit it. Provide queries and one or two expected result rows to prove understanding.
3) UNION vs UNION ALL: Combine web_orders and store_orders into a single set of (order_id, customer_id, amount). (a) Using UNION ALL, count total rows and the number of duplicate order_ids. (b) Using UNION (distinct), show deduplicated rows. (c) Return only the order_ids that appear in both sources without using INTERSECT.
4) Window functions: (a) For each customer, rank their orders by amount DESC and return the top 1 per customer using ROW_NUMBER. (b) Compute a running total of amount per customer ordered by order_date; show partition boundaries. (c) For August 2025, compute each customer’s rank by monthly spend using SUM(amount) OVER(PARTITION BY customer_id) and then DENSE_RANK across customers. Explain why a window does not collapse rows like GROUP BY.
5) View vs table: Create a view top_spenders_2025_08(customer_id, total_amount) over orders. Is it updatable? Under what conditions? What happens if the base table adds a NOT NULL column or if the view definition references a column later dropped? Discuss pros/cons of views vs materialized tables for this case and how to refresh a materialized view (name your chosen RDBMS).
6) Hide duplicated rows (select-time): Return one row per distinct (order_id, customer_id, amount) from web_orders without deleting data. Show two approaches: DISTINCT and ROW_NUMBER() filtering; discuss performance and which preserves deterministic choice.
7) Remove duplicated rows (data-change): In web_orders, delete true duplicates keeping the lowest ROW_NUMBER by (order_id, customer_id, amount) and explain how you’d do this safely in a transaction with a reproducible tie-breaker. Provide the DELETE…CTE you would run.
8) LC #177 (Nth highest salary): Using employees(name, salary), write a query that returns the Nth highest distinct salary given a parameter :n. Show outputs for n=1,2,3 on the sample data and explain behavior when n exceeds the number of distinct salaries.
9) Improve SQL efficiency: Propose an indexing strategy for the above workloads (consider orders(customer_id, order_date), order_items(order_id), and covering indexes). Show how to rewrite one query to be sargable (e.g., avoid functions on indexed columns), and explain when EXISTS outperforms IN and when a window function can replace a self-join. Describe how you’d verify improvements using an execution plan and timing.
10) Python pandas merge vs join vs concat: Given DataFrames below, specify the exact pandas calls (with how=, on=, axis=, ignore_index=, validate=) to: (a) left-join spend onto cities; (b) inner-join both then right-join to keep ids only in df_b; (c) stack df_a and a new rows frame vertically; (d) align on index instead of column; (e) explain when concat vs merge is appropriate.

DataFrames:
df_a
| id | city |
|---:|------|
| 1 | SF |
| 2 | NYC |
| 3 | LA |

df_b
| id | spend |
|---:|------:|
| 2 | 80 |
| 3 | 20 |
| 4 | 10 |

df_c
| id | tag |
|---:|-----|
| 3 | vip |

PracHub · Accepted Answer

This question evaluates proficiency in SQL and pandas data-manipulation techniques, including aggregations with NULL semantics, various join types, set operations and deduplication, window functions, and reconciling normalized and denormalized sources.

Quick Overview

Quick Overview