Deduplicate events and rank products with SQL

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Google.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Google during technical interviews.

Question

You are given two tables.

Schema:
- events(event_id INT PRIMARY KEY, user_id INT, product_id INT, event_time TIMESTAMP, idempotency_key TEXT, amount_cents INT)
- products(product_id INT PRIMARY KEY, product_name TEXT)

Sample data (minimal but representative):

events
+----------+---------+------------+---------------------+-----------------+--------------+
| event_id | user_id | product_id | event_time | idempotency_key | amount_cents |
+----------+---------+------------+---------------------+-----------------+--------------+
| 101 | 1 | 10 | 2025-08-31 23:59:58 | abc | 1299 |
| 102 | 1 | 10 | 2025-08-31 23:59:59 | abc | 1299 |
| 103 | 2 | 10 | 2025-09-01 00:00:03 | def | 1299 |
| 104 | 2 | 10 | 2025-09-01 00:05:01 | def | 1299 |
| 105 | 2 | 20 | 2025-09-01 00:06:00 | ghi | 2599 |
| 106 | 3 | 20 | 2025-09-01 12:00:00 | jkl | 2599 |
| 107 | 1 | 30 | 2025-09-01 12:05:00 | mno | 3099 |
+----------+---------+------------+---------------------+-----------------+--------------+

products
+------------+--------------+
| product_id | product_name |
+------------+--------------+
| 10 | Basic Tee |
| 20 | Hoodie |
| 30 | Socks |
+------------+--------------+

Task A — De-duplicate retry events:
Some payments are retried and share the same (user_id, idempotency_key). Write ANSI SQL that keeps exactly one row per (user_id, idempotency_key), choosing the row with the earliest event_time; break ties by the smallest event_id. Return all columns of the kept rows.

Task B — Rank products by distinct purchasers for a given date:
Using the de-duplicated rows from Task A, write a single SQL query that returns, for the calendar date 2025-09-01 (UTC), the top 2 products by distinct purchasing users.
- Output columns: product_id, product_name, distinct_buyers, rank.
- Use window functions to compute ranks; break ties by product_id ASC.
- Do not use temporary tables; a CTE-based solution is acceptable.
Explain how your solution scales if events has 10^9 rows and how you would index to support it.

PracHub · Accepted Answer

This question evaluates proficiency in SQL data manipulation and analytical querying—specifically deduplication using idempotency keys, window functions/CTEs for ranking, distinct aggregation by date, timestamp tie-breaking, and awareness of indexing and scalability trade-offs in large tables; the domain is Data Manipulation (SQL/Python).

Quick Overview

Quick Overview