Write SQL for dedup and purchase shares

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at CVS Health.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at CVS Health during technical interviews.

Question

You are given two tables with intentional duplicates. Write SQL to: (a) identify duplicate user_ids and produce a canonical, deduplicated users set; (b) list users with purchase_count >= 2 using a deduplicated view of purchases; (c) produce a distribution table of purchase_count per user (including users with 0 purchases) and the percentage each bucket represents out of all unique users.

Schema
- users(user_id INT, name VARCHAR, signup_date DATE)
- purchases(order_id INT, user_id INT, purchased_at DATE, amount_cents INT)

Sample data (small, but representative)
users
user_id | name | signup_date
101 | Alice | 2025-01-01
101 | Alice M | 2025-01-01
102 | Bob | 2025-01-03
103 | Cathy | 2025-01-04
103 | C. Li | 2025-01-04
104 | Dan | 2025-01-05

purchases
order_id | user_id | purchased_at | amount_cents
7001 | 101 | 2025-01-10 | 1200
7001 | 101 | 2025-01-10 | 1200 -- duplicate row
7002 | 102 | 2025-01-11 | 500
7003 | 103 | 2025-01-11 | 700
7003 | 103 | 2025-01-11 | 700 -- duplicate row
7004 | 103 | 2025-01-12 | 400
7005 | 101 | 2025-01-13 | 350

Requirements and hints
- (a) Output two result sets: one listing duplicate user_id rows (i.e., all users where count(*) over user_id > 1) and one CTE/view deduped_users with exactly one row per user_id. Choose the canonical row as the earliest signup_date; if tied, pick lexicographically smallest name. Show your window-function logic explicitly.
- (b) Before counting purchases, deduplicate purchases so each order_id contributes at most once (keep the earliest purchased_at per order_id). Use this deduped_orders CTE to compute purchase_count per user and then return all users with purchase_count >= 2.
- (c) Using deduped_users LEFT JOIN deduped_orders, compute for each exact purchase_count (0,1,2,...) the number of users and percent_of_users = number_in_bucket / total_unique_users, rounded to 2 decimals. Return rows sorted by purchase_count ascending. Ensure users with 0 purchases appear.

PracHub · Accepted Answer

This question evaluates proficiency in SQL data manipulation skills including entity deduplication, canonicalization rules, window functions, common table expressions, joins, and aggregation to derive per-user purchase counts and distributions.

Quick Overview

Quick Overview