You are given two tables with intentional duplicates. Write SQL to: (a) identify duplicate user_ids and produce a canonical, deduplicated users set; (b) list users with purchase_count >= 2 using a deduplicated view of purchases; (c) produce a distribution table of purchase_count per user (including users with 0 purchases) and the percentage each bucket represents out of all unique users.
Schema
-
users(user_id INT, name VARCHAR, signup_date DATE)
-
purchases(order_id INT, user_id INT, purchased_at DATE, amount_cents INT)
Sample data (small, but representative)
users
user_id | name | signup_date
101 | Alice | 2025-01-01
101 | Alice M | 2025-01-01
102 | Bob | 2025-01-03
103 | Cathy | 2025-01-04
103 | C. Li | 2025-01-04
104 | Dan | 2025-01-05
purchases
order_id | user_id | purchased_at | amount_cents
7001 | 101 | 2025-01-10 | 1200
7001 | 101 | 2025-01-10 | 1200 -- duplicate row
7002 | 102 | 2025-01-11 | 500
7003 | 103 | 2025-01-11 | 700
7003 | 103 | 2025-01-11 | 700 -- duplicate row
7004 | 103 | 2025-01-12 | 400
7005 | 101 | 2025-01-13 | 350
Requirements and hints
-
(a) Output two result sets: one listing duplicate user_id rows (i.e., all users where count(*) over user_id > 1) and one CTE/view deduped_users with exactly one row per user_id. Choose the canonical row as the earliest signup_date; if tied, pick lexicographically smallest name. Show your window-function logic explicitly.
-
(b) Before counting purchases, deduplicate purchases so each order_id contributes at most once (keep the earliest purchased_at per order_id). Use this deduped_orders CTE to compute purchase_count per user and then return all users with purchase_count >= 2.
-
(c) Using deduped_users LEFT JOIN deduped_orders, compute for each exact purchase_count (0,1,2,...) the number of users and percent_of_users = number_in_bucket / total_unique_users, rounded to 2 decimals. Return rows sorted by purchase_count ascending. Ensure users with 0 purchases appear.