You are given two tables. Write SQL and Python (pandas) to answer the sub-questions precisely, handling edge cases, ties, and missing data.
Schema
-
subscriptions(subscription_id INT, status VARCHAR(20), status_date DATE)
-
orders(order_id INT, cust_id INT, order_date DATE, product VARCHAR(30), order_amount DECIMAL(10,2), shipping_option_id INT)
Sample data (small, intentionally messy)
subscriptions
| subscription_id | status | status_date |
|---|
| 1001 | inactive | 2025-01-10 |
| 1001 | active | 2025-02-01 |
| 1001 | paused | 2025-03-15 |
| 1001 | inactive | 2025-04-01 |
| 1002 | active | 2025-01-05 |
| 1002 | active | 2025-01-05 |
| 1003 | active | 2025-02-10 |
| 1003 | inactive | 2025-02-08 |
orders
| order_id | cust_id | order_date | product | order_amount | shipping_option_id |
|---|
| 1 | 10 | 2025-01-05 | camera | 80.00 | 1 |
| 2 | 10 | 2025-01-20 | shoes | 30.00 | 2 |
| 3 | 11 | 2025-01-25 | laptop | 1200.00 | 1 |
| 4 | 12 | 2025-02-02 | clothes | 40.00 | 3 |
| 5 | 12 | 2025-02-10 | camera | 70.00 | 1 |
| 6 | 13 | 2025-02-15 | shoes | 20.00 | 2 |
| 7 | 13 | 2025-02-16 | shoes | 25.00 | 2 |
| 8 | 14 | 2025-03-01 | laptop | 999.00 | 1 |
| 9 | 10 | 2025-03-05 | clothes | 15.00 | 3 |
| 10 | 15 | 2025-03-08 | camera | 50.00 | 1 |
Tasks
A) SQL data-quality checks on subscriptions
-
Write one or more queries that would rigorously confirm or refute all of the following assumptions about the subscriptions table, returning concrete violating rows if any exist:
-
Allowed statuses are only {'active','inactive'}; surface any unexpected statuses (e.g., 'paused').
-
(subscription_id, status_date) is unique; list duplicates if present.
-
For each subscription_id, status_date values are strictly increasing over time; surface any non-monotonic back-dated rows.
-
For each subscription_id and calendar date, there is at most one status; surface overlapping same-day multi-status cases.
-
Additionally, as of reference_date = '2025-09-01', return for each subscription_id: its latest known status and the timestamp of that status.
B) Pandas on subscriptions
-
Create a pandas DataFrame with columns [subscription_id, first_active_date, last_inactive_date], where:
• first_active_date is the earliest status_date with status = 'active'.
• last_inactive_date is the most recent status_date with status = 'inactive' on or before '2025-09-01'.
-
Requirements: handle ties/duplicates deterministically (pick MIN for first_active_date, MAX for last_inactive_date), ignore unexpected statuses, and allow nulls when a subscription never became active or inactive.
C) Python (pandas) on orders
-
Return the list of cust_id who either (a) placed fewer than 2 total orders, or (b) have total order_amount across all time < 100.00.
-
For each calendar month (YYYY-MM based on order_date), compute two leaderboards:
• Top 5 customers by order count.
• Top 5 customers by total order_amount.
Use tie-breakers: higher total_amount first, then lower cust_id; if fewer than 5 exist, return all available.
-
Compute each customer's total spend per product type and present a wide table with columns exactly: cust_id | camera | shoes | laptop | clothes. Missing combinations should be filled with 0.00.