Verify subscriptions and analyze orders with SQL/Python
Company: Amazon
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
You are given two tables. Write SQL and Python (pandas) to answer the sub-questions precisely, handling edge cases, ties, and missing data.
Schema
- subscriptions(subscription_id INT, status VARCHAR(20), status_date DATE)
- orders(order_id INT, cust_id INT, order_date DATE, product VARCHAR(30), order_amount DECIMAL(10,2), shipping_option_id INT)
Sample data (small, intentionally messy)
subscriptions
subscription_id | status | status_date
--------------: | -------- | -----------
1001 | inactive | 2025-01-10
1001 | active | 2025-02-01
1001 | paused | 2025-03-15
1001 | inactive | 2025-04-01
1002 | active | 2025-01-05
1002 | active | 2025-01-05
1003 | active | 2025-02-10
1003 | inactive | 2025-02-08
orders
order_id | cust_id | order_date | product | order_amount | shipping_option_id
-------: | ------: | ---------- | ------- | -----------: | -----------------:
1 | 10 | 2025-01-05 | camera | 80.00 | 1
2 | 10 | 2025-01-20 | shoes | 30.00 | 2
3 | 11 | 2025-01-25 | laptop | 1200.00 | 1
4 | 12 | 2025-02-02 | clothes | 40.00 | 3
5 | 12 | 2025-02-10 | camera | 70.00 | 1
6 | 13 | 2025-02-15 | shoes | 20.00 | 2
7 | 13 | 2025-02-16 | shoes | 25.00 | 2
8 | 14 | 2025-03-01 | laptop | 999.00 | 1
9 | 10 | 2025-03-05 | clothes | 15.00 | 3
10 | 15 | 2025-03-08 | camera | 50.00 | 1
Tasks
A) SQL data-quality checks on subscriptions
- Write one or more queries that would rigorously confirm or refute all of the following assumptions about the subscriptions table, returning concrete violating rows if any exist:
1) Allowed statuses are only {'active','inactive'}; surface any unexpected statuses (e.g., 'paused').
2) (subscription_id, status_date) is unique; list duplicates if present.
3) For each subscription_id, status_date values are strictly increasing over time; surface any non-monotonic back-dated rows.
4) For each subscription_id and calendar date, there is at most one status; surface overlapping same-day multi-status cases.
- Additionally, as of reference_date = '2025-09-01', return for each subscription_id: its latest known status and the timestamp of that status.
B) Pandas on subscriptions
- Create a pandas DataFrame with columns [subscription_id, first_active_date, last_inactive_date], where:
• first_active_date is the earliest status_date with status = 'active'.
• last_inactive_date is the most recent status_date with status = 'inactive' on or before '2025-09-01'.
- Requirements: handle ties/duplicates deterministically (pick MIN for first_active_date, MAX for last_inactive_date), ignore unexpected statuses, and allow nulls when a subscription never became active or inactive.
C) Python (pandas) on orders
1) Return the list of cust_id who either (a) placed fewer than 2 total orders, or (b) have total order_amount across all time < 100.00.
2) For each calendar month (YYYY-MM based on order_date), compute two leaderboards:
• Top 5 customers by order count.
• Top 5 customers by total order_amount.
Use tie-breakers: higher total_amount first, then lower cust_id; if fewer than 5 exist, return all available.
3) Compute each customer's total spend per product type and present a wide table with columns exactly: cust_id | camera | shoes | laptop | clothes. Missing combinations should be filled with 0.00.
Quick Answer: This question evaluates proficiency in SQL and pandas data manipulation, covering data-quality validation, temporal sequence reasoning, deduplication, deterministic aggregation, and tie-breaking logic.