Design SQL/Pandas aggregations on retail schema
Company: Amazon
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Using the schema and sample data below, answer both parts. Assume today is 2025-09-01. Use standard SQL (e.g., PostgreSQL) and idiomatic pandas without Python for-loops over rows.
Schema and sample rows
customers
+---------+------+------------+
| cust_id | name | signup_date|
+---------+------+------------+
| C1 | Ada | 2025-08-28 |
| C2 | Ben | 2025-08-29 |
| C3 | Cy | 2025-08-25 |
+---------+------+------------+
orders
+----------+---------+------------+-----------+
| order_id | cust_id | order_date | status |
+----------+---------+------------+-----------+
| 101 | C1 | 2025-08-29 | completed |
| 102 | C1 | 2025-08-30 | completed |
| 107 | C1 | 2025-08-30 | completed |
| 103 | C1 | 2025-09-01 | returned |
| 104 | C2 | 2025-08-31 | completed |
| 105 | C2 | 2025-09-01 | completed |
| 106 | C3 | 2025-08-25 | cancelled |
+----------+---------+------------+-----------+
order_items
+----------+------------+-----+--------+
| order_id | product_id | qty | price |
+----------+------------+-----+--------+
| 101 | P1 | 1 | 10.00 |
| 101 | P2 | 2 | 5.00 |
| 102 | P1 | 1 | 10.00 |
| 107 | P2 | 1 | 5.00 |
| 104 | P3 | 1 | 20.00 |
| 105 | P3 | 2 | 20.00 |
+----------+------------+-----+--------+
products
+------------+----------+
| product_id | category |
+------------+----------+
| P1 | A |
| P2 | B |
| P3 | A |
+------------+----------+
Definitions
- Order revenue = sum(qty * price) over items for that order.
- Only orders with status = completed count toward revenue; returned or cancelled do not.
- If a customer has multiple completed orders on the same calendar day, keep only one order for that day: pick the order with the greater order revenue; if there is a tie on revenue, keep the higher order_id.
Part A (SQL)
Write a single SQL query that returns, for each customer with at least two completed orders in the last 7 days (inclusive of today = 2025-09-01), one row per kept order (after the same-day de-duplication rule) with the following columns:
- cust_id
- order_id (after applying the same-day selection rule)
- order_date
- order_revenue
- rolling_3_day_rev: sum of the customer's order_revenue over the window [order_date - 2 days, order_date], considering only completed orders that survived the same-day selection rule
- rank_in_7d_by_revenue: dense rank of this kept order's revenue among the customer's kept orders in the last 7 days, highest revenue gets rank 1
Requirements: implement same-day selection without correlated subqueries; use window functions; do not use temporary tables.
Part B (pandas)
Using pandas DataFrames with the same content, produce a DataFrame with one row per customer for the last 7 days (inclusive, relative to today = 2025-09-01) containing:
- cust_id
- total_7d_revenue
- top_category_7d: the product category with the highest revenue for that customer in the 7-day window (break ties by alphabetical order of category)
- top_category_share_7d: the fraction (0–1) of the customer's 7-day revenue attributable to top_category_7d
Constraints: avoid Python loops; show how you enforce the same-day order selection rule prior to aggregation.