Write SQL for dedup and purchase shares

Q: Write SQL for dedup and purchase shares

This is a Data Manipulation (SQL/Python) interview question from CVS Health for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

You are given two tables with intentional duplicates. Write SQL to: (a) identify duplicate user_ids and produce a canonical, deduplicated users set; (b) list users with purchase_count >= 2 using a deduplicated view of purchases; (c) produce a distribution table of purchase_count per user (including users with 0 purchases) and the percentage each bucket represents out of all unique users.

Schema

users(user_id INT, name VARCHAR, signup_date DATE)
purchases(order_id INT, user_id INT, purchased_at DATE, amount_cents INT)

Sample data (small, but representative) users user_id | name | signup_date 101 | Alice | 2025-01-01 101 | Alice M | 2025-01-01 102 | Bob | 2025-01-03 103 | Cathy | 2025-01-04 103 | C. Li | 2025-01-04 104 | Dan | 2025-01-05

purchases order_id | user_id | purchased_at | amount_cents 7001 | 101 | 2025-01-10 | 1200 7001 | 101 | 2025-01-10 | 1200 -- duplicate row 7002 | 102 | 2025-01-11 | 500 7003 | 103 | 2025-01-11 | 700 7003 | 103 | 2025-01-11 | 700 -- duplicate row 7004 | 103 | 2025-01-12 | 400 7005 | 101 | 2025-01-13 | 350

Requirements and hints

(a) Output two result sets: one listing duplicate user_id rows (i.e., all users where count(*) over user_id > 1) and one CTE/view deduped_users with exactly one row per user_id. Choose the canonical row as the earliest signup_date; if tied, pick lexicographically smallest name. Show your window-function logic explicitly.
(b) Before counting purchases, deduplicate purchases so each order_id contributes at most once (keep the earliest purchased_at per order_id). Use this deduped_orders CTE to compute purchase_count per user and then return all users with purchase_count >= 2.
(c) Using deduped_users LEFT JOIN deduped_orders, compute for each exact purchase_count (0,1,2,...) the number of users and percent_of_users = number_in_bucket / total_unique_users, rounded to 2 decimals. Return rows sorted by purchase_count ascending. Ensure users with 0 purchases appear.

Write SQL for dedup and purchase shares

Comments (0)