Using the schema and sample data below, answer all sub-questions. Unless stated, assume ANSI SQL and explain any dialect-specific choices.
Schema:
-
customers(customer_id PK, name)
-
orders(order_id PK, customer_id FK->customers, order_date DATE, amount DECIMAL(10,2))
-
order_items(item_id PK, order_id FK->orders, product_id FK->products, qty INT, price DECIMAL(10,2))
-
products(product_id PK, name, category)
-
web_orders(order_id, customer_id, amount)
-
store_orders(order_id, customer_id, amount)
-
employees(emp_id PK, name, salary INT)
Sample tables (tiny):
customers
| customer_id | name |
|---|
| 1 | Alice |
| 2 | Bob |
| 3 | Carol |
orders
| order_id | customer_id | order_date | amount |
|---|
| 101 | 1 | 2025-08-28 | 120.00 |
| 102 | 1 | 2025-08-30 | 80.00 |
| 103 | 2 | 2025-08-29 | 50.00 |
| 105 | 3 | 2025-08-30 | NULL |
order_items
| item_id | order_id | product_id | qty | price |
|---|
| 1 | 101 | 10 | 1 | 120.00 |
| 2 | 102 | 11 | 2 | 40.00 |
| 3 | 103 | 12 | 1 | 50.00 |
products
| product_id | name | category |
|---|
| 10 | Phone | Electronics |
| 11 | Cable | Electronics |
| 12 | Book | Media |
web_orders
| order_id | customer_id | amount |
|---|
| 201 | 1 | 50.00 |
| 202 | 1 | 50.00 |
| 202 | 1 | 50.00 |
store_orders
| order_id | customer_id | amount |
|---|
| 301 | 1 | 50.00 |
| 202 | 1 | 50.00 |
| 302 | 2 | 50.00 |
employees
| emp_id | name | salary |
|---|
| 1 | Alice | 100 |
| 2 | Bob | 200 |
| 3 | Carol | 300 |
| 4 | Dave | 300 |
Tasks:
-
Aggregations and NULLs: Write queries that compute per-customer: (a) COUNT(
), COUNT(amount), SUM(amount), AVG(amount), MIN/MAX(order_date). Show the result difference caused by orders.amount being NULL for customer_id=3. Explain COUNT(column) vs COUNT(
), and how AVG ignores NULLs. Also compute total revenue from order_items using SUM(qty*price) and reconcile it to SUM(orders.amount); explain any mismatch and how to detect inconsistencies.
-
Joins: (a) INNER JOIN customers↔orders to list customers who placed orders. (b) LEFT JOIN customers↔orders to include customers with zero orders; return 0 as totals for those. (c) FULL OUTER JOIN customers↔orders and explain which rows appear only on one side. (d) CROSS JOIN products↔(SELECT DISTINCT category) to illustrate Cartesian output and how to limit it. Provide queries and one or two expected result rows to prove understanding.
-
UNION vs UNION ALL: Combine web_orders and store_orders into a single set of (order_id, customer_id, amount). (a) Using UNION ALL, count total rows and the number of duplicate order_ids. (b) Using UNION (distinct), show deduplicated rows. (c) Return only the order_ids that appear in both sources without using INTERSECT.
-
Window functions: (a) For each customer, rank their orders by amount DESC and return the top 1 per customer using ROW_NUMBER. (b) Compute a running total of amount per customer ordered by order_date; show partition boundaries. (c) For August 2025, compute each customer’s rank by monthly spend using SUM(amount) OVER(PARTITION BY customer_id) and then DENSE_RANK across customers. Explain why a window does not collapse rows like GROUP BY.
-
View vs table: Create a view top_spenders_2025_08(customer_id, total_amount) over orders. Is it updatable? Under what conditions? What happens if the base table adds a NOT NULL column or if the view definition references a column later dropped? Discuss pros/cons of views vs materialized tables for this case and how to refresh a materialized view (name your chosen RDBMS).
-
Hide duplicated rows (select-time): Return one row per distinct (order_id, customer_id, amount) from web_orders without deleting data. Show two approaches: DISTINCT and ROW_NUMBER() filtering; discuss performance and which preserves deterministic choice.
-
Remove duplicated rows (data-change): In web_orders, delete true duplicates keeping the lowest ROW_NUMBER by (order_id, customer_id, amount) and explain how you’d do this safely in a transaction with a reproducible tie-breaker. Provide the DELETE…CTE you would run.
-
LC #177 (Nth highest salary): Using employees(name, salary), write a query that returns the Nth highest distinct salary given a parameter :n. Show outputs for n=1,2,3 on the sample data and explain behavior when n exceeds the number of distinct salaries.
-
Improve SQL efficiency: Propose an indexing strategy for the above workloads (consider orders(customer_id, order_date), order_items(order_id), and covering indexes). Show how to rewrite one query to be sargable (e.g., avoid functions on indexed columns), and explain when EXISTS outperforms IN and when a window function can replace a self-join. Describe how you’d verify improvements using an execution plan and timing.
-
Python pandas merge vs join vs concat: Given DataFrames below, specify the exact pandas calls (with how=, on=, axis=, ignore_index=, validate=) to: (a) left-join spend onto cities; (b) inner-join both then right-join to keep ids only in df_b; (c) stack df_a and a new rows frame vertically; (d) align on index instead of column; (e) explain when concat vs merge is appropriate.
DataFrames:
df_a
df_b
df_c