PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches

Quick Overview

This question evaluates proficiency in SQL data manipulation skills including entity deduplication, canonicalization rules, window functions, common table expressions, joins, and aggregation to derive per-user purchase counts and distributions.

  • Medium
  • CVS Health
  • Data Manipulation (SQL/Python)
  • Data Scientist

Write SQL for dedup and purchase shares

Company: CVS Health

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Technical Screen

You are given two tables with intentional duplicates. Write SQL to: (a) identify duplicate user_ids and produce a canonical, deduplicated users set; (b) list users with purchase_count >= 2 using a deduplicated view of purchases; (c) produce a distribution table of purchase_count per user (including users with 0 purchases) and the percentage each bucket represents out of all unique users. Schema - users(user_id INT, name VARCHAR, signup_date DATE) - purchases(order_id INT, user_id INT, purchased_at DATE, amount_cents INT) Sample data (small, but representative) users user_id | name | signup_date 101 | Alice | 2025-01-01 101 | Alice M | 2025-01-01 102 | Bob | 2025-01-03 103 | Cathy | 2025-01-04 103 | C. Li | 2025-01-04 104 | Dan | 2025-01-05 purchases order_id | user_id | purchased_at | amount_cents 7001 | 101 | 2025-01-10 | 1200 7001 | 101 | 2025-01-10 | 1200 -- duplicate row 7002 | 102 | 2025-01-11 | 500 7003 | 103 | 2025-01-11 | 700 7003 | 103 | 2025-01-11 | 700 -- duplicate row 7004 | 103 | 2025-01-12 | 400 7005 | 101 | 2025-01-13 | 350 Requirements and hints - (a) Output two result sets: one listing duplicate user_id rows (i.e., all users where count(*) over user_id > 1) and one CTE/view deduped_users with exactly one row per user_id. Choose the canonical row as the earliest signup_date; if tied, pick lexicographically smallest name. Show your window-function logic explicitly. - (b) Before counting purchases, deduplicate purchases so each order_id contributes at most once (keep the earliest purchased_at per order_id). Use this deduped_orders CTE to compute purchase_count per user and then return all users with purchase_count >= 2. - (c) Using deduped_users LEFT JOIN deduped_orders, compute for each exact purchase_count (0,1,2,...) the number of users and percent_of_users = number_in_bucket / total_unique_users, rounded to 2 decimals. Return rows sorted by purchase_count ascending. Ensure users with 0 purchases appear.

Quick Answer: This question evaluates proficiency in SQL data manipulation skills including entity deduplication, canonicalization rules, window functions, common table expressions, joins, and aggregation to derive per-user purchase counts and distributions.

Last updated: Mar 29, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Create and query an e-commerce schema - CVS Health (Medium)
  • Aggregate radiology spend and derive fiscal month - CVS Health (Medium)
  • Compute age-band spend and YoY in Georgia - CVS Health (Medium)
  • Calculate annual percentages and YoY by cohorts - CVS Health (Medium)
  • Use pandas to aggregate, pivot, and label - CVS Health (Medium)