Assess SQL joins, unions, windows, dedup, and pandas
Company: Fannie Mae
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Using the schema and sample data below, answer all sub-questions. Unless stated, assume ANSI SQL and explain any dialect-specific choices.
Schema:
- customers(customer_id PK, name)
- orders(order_id PK, customer_id FK->customers, order_date DATE, amount DECIMAL(10,2))
- order_items(item_id PK, order_id FK->orders, product_id FK->products, qty INT, price DECIMAL(10,2))
- products(product_id PK, name, category)
- web_orders(order_id, customer_id, amount)
- store_orders(order_id, customer_id, amount)
- employees(emp_id PK, name, salary INT)
Sample tables (tiny):
customers
| customer_id | name |
|------------:|-------|
| 1 | Alice |
| 2 | Bob |
| 3 | Carol |
orders
| order_id | customer_id | order_date | amount |
|---------:|------------:|-------------|---------|
| 101 | 1 | 2025-08-28 | 120.00 |
| 102 | 1 | 2025-08-30 | 80.00 |
| 103 | 2 | 2025-08-29 | 50.00 |
| 105 | 3 | 2025-08-30 | NULL |
order_items
| item_id | order_id | product_id | qty | price |
|--------:|---------:|-----------:|----:|--------|
| 1 | 101 | 10 | 1 | 120.00 |
| 2 | 102 | 11 | 2 | 40.00 |
| 3 | 103 | 12 | 1 | 50.00 |
products
| product_id | name | category |
|-----------:|-------|--------------|
| 10 | Phone | Electronics |
| 11 | Cable | Electronics |
| 12 | Book | Media |
web_orders
| order_id | customer_id | amount |
|---------:|------------:|-------:|
| 201 | 1 | 50.00 |
| 202 | 1 | 50.00 |
| 202 | 1 | 50.00 |
store_orders
| order_id | customer_id | amount |
|---------:|------------:|-------:|
| 301 | 1 | 50.00 |
| 202 | 1 | 50.00 |
| 302 | 2 | 50.00 |
employees
| emp_id | name | salary |
|------:|-------|-------:|
| 1 | Alice | 100 |
| 2 | Bob | 200 |
| 3 | Carol | 300 |
| 4 | Dave | 300 |
Tasks:
1) Aggregations and NULLs: Write queries that compute per-customer: (a) COUNT(*), COUNT(amount), SUM(amount), AVG(amount), MIN/MAX(order_date). Show the result difference caused by orders.amount being NULL for customer_id=3. Explain COUNT(column) vs COUNT(*), and how AVG ignores NULLs. Also compute total revenue from order_items using SUM(qty*price) and reconcile it to SUM(orders.amount); explain any mismatch and how to detect inconsistencies.
2) Joins: (a) INNER JOIN customers↔orders to list customers who placed orders. (b) LEFT JOIN customers↔orders to include customers with zero orders; return 0 as totals for those. (c) FULL OUTER JOIN customers↔orders and explain which rows appear only on one side. (d) CROSS JOIN products↔(SELECT DISTINCT category) to illustrate Cartesian output and how to limit it. Provide queries and one or two expected result rows to prove understanding.
3) UNION vs UNION ALL: Combine web_orders and store_orders into a single set of (order_id, customer_id, amount). (a) Using UNION ALL, count total rows and the number of duplicate order_ids. (b) Using UNION (distinct), show deduplicated rows. (c) Return only the order_ids that appear in both sources without using INTERSECT.
4) Window functions: (a) For each customer, rank their orders by amount DESC and return the top 1 per customer using ROW_NUMBER. (b) Compute a running total of amount per customer ordered by order_date; show partition boundaries. (c) For August 2025, compute each customer’s rank by monthly spend using SUM(amount) OVER(PARTITION BY customer_id) and then DENSE_RANK across customers. Explain why a window does not collapse rows like GROUP BY.
5) View vs table: Create a view top_spenders_2025_08(customer_id, total_amount) over orders. Is it updatable? Under what conditions? What happens if the base table adds a NOT NULL column or if the view definition references a column later dropped? Discuss pros/cons of views vs materialized tables for this case and how to refresh a materialized view (name your chosen RDBMS).
6) Hide duplicated rows (select-time): Return one row per distinct (order_id, customer_id, amount) from web_orders without deleting data. Show two approaches: DISTINCT and ROW_NUMBER() filtering; discuss performance and which preserves deterministic choice.
7) Remove duplicated rows (data-change): In web_orders, delete true duplicates keeping the lowest ROW_NUMBER by (order_id, customer_id, amount) and explain how you’d do this safely in a transaction with a reproducible tie-breaker. Provide the DELETE…CTE you would run.
8) LC #177 (Nth highest salary): Using employees(name, salary), write a query that returns the Nth highest distinct salary given a parameter :n. Show outputs for n=1,2,3 on the sample data and explain behavior when n exceeds the number of distinct salaries.
9) Improve SQL efficiency: Propose an indexing strategy for the above workloads (consider orders(customer_id, order_date), order_items(order_id), and covering indexes). Show how to rewrite one query to be sargable (e.g., avoid functions on indexed columns), and explain when EXISTS outperforms IN and when a window function can replace a self-join. Describe how you’d verify improvements using an execution plan and timing.
10) Python pandas merge vs join vs concat: Given DataFrames below, specify the exact pandas calls (with how=, on=, axis=, ignore_index=, validate=) to: (a) left-join spend onto cities; (b) inner-join both then right-join to keep ids only in df_b; (c) stack df_a and a new rows frame vertically; (d) align on index instead of column; (e) explain when concat vs merge is appropriate.
DataFrames:
df_a
| id | city |
|---:|------|
| 1 | SF |
| 2 | NYC |
| 3 | LA |
df_b
| id | spend |
|---:|------:|
| 2 | 80 |
| 3 | 20 |
| 4 | 10 |
df_c
| id | tag |
|---:|-----|
| 3 | vip |
Quick Answer: This question evaluates proficiency in SQL and pandas data-manipulation techniques, including aggregations with NULL semantics, various join types, set operations and deduplication, window functions, and reconciling normalized and denormalized sources.