Deduplicate events and rank products with SQL

Q: Deduplicate events and rank products with SQL

This is a Data Manipulation (SQL/Python) interview question from Google for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

You are given two tables.

Schema:

events(event_id INT PRIMARY KEY, user_id INT, product_id INT, event_time TIMESTAMP, idempotency_key TEXT, amount_cents INT)
products(product_id INT PRIMARY KEY, product_name TEXT)

Sample data (minimal but representative):

events +----------+---------+------------+---------------------+-----------------+--------------+ | event_id | user_id | product_id | event_time | idempotency_key | amount_cents | +----------+---------+------------+---------------------+-----------------+--------------+ | 101 | 1 | 10 | 2025-08-31 23:59:58 | abc | 1299 | | 102 | 1 | 10 | 2025-08-31 23:59:59 | abc | 1299 | | 103 | 2 | 10 | 2025-09-01 00:00:03 | def | 1299 | | 104 | 2 | 10 | 2025-09-01 00:05:01 | def | 1299 | | 105 | 2 | 20 | 2025-09-01 00:06:00 | ghi | 2599 | | 106 | 3 | 20 | 2025-09-01 12:00:00 | jkl | 2599 | | 107 | 1 | 30 | 2025-09-01 12:05:00 | mno | 3099 | +----------+---------+------------+---------------------+-----------------+--------------+

products +------------+--------------+ | product_id | product_name | +------------+--------------+ | 10 | Basic Tee | | 20 | Hoodie | | 30 | Socks | +------------+--------------+

Task A — De-duplicate retry events: Some payments are retried and share the same (user_id, idempotency_key). Write ANSI SQL that keeps exactly one row per (user_id, idempotency_key), choosing the row with the earliest event_time; break ties by the smallest event_id. Return all columns of the kept rows.

Task B — Rank products by distinct purchasers for a given date: Using the de-duplicated rows from Task A, write a single SQL query that returns, for the calendar date 2025-09-01 (UTC), the top 2 products by distinct purchasing users.

Output columns: product_id, product_name, distinct_buyers, rank.
Use window functions to compute ranks; break ties by product_id ASC.
Do not use temporary tables; a CTE-based solution is acceptable. Explain how your solution scales if events has 10^9 rows and how you would index to support it.

Deduplicate events and rank products with SQL

Comments (0)