Deduplicate events and rank products with SQL
Company: Google
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
You are given two tables.
Schema:
- events(event_id INT PRIMARY KEY, user_id INT, product_id INT, event_time TIMESTAMP, idempotency_key TEXT, amount_cents INT)
- products(product_id INT PRIMARY KEY, product_name TEXT)
Sample data (minimal but representative):
events
+----------+---------+------------+---------------------+-----------------+--------------+
| event_id | user_id | product_id | event_time | idempotency_key | amount_cents |
+----------+---------+------------+---------------------+-----------------+--------------+
| 101 | 1 | 10 | 2025-08-31 23:59:58 | abc | 1299 |
| 102 | 1 | 10 | 2025-08-31 23:59:59 | abc | 1299 |
| 103 | 2 | 10 | 2025-09-01 00:00:03 | def | 1299 |
| 104 | 2 | 10 | 2025-09-01 00:05:01 | def | 1299 |
| 105 | 2 | 20 | 2025-09-01 00:06:00 | ghi | 2599 |
| 106 | 3 | 20 | 2025-09-01 12:00:00 | jkl | 2599 |
| 107 | 1 | 30 | 2025-09-01 12:05:00 | mno | 3099 |
+----------+---------+------------+---------------------+-----------------+--------------+
products
+------------+--------------+
| product_id | product_name |
+------------+--------------+
| 10 | Basic Tee |
| 20 | Hoodie |
| 30 | Socks |
+------------+--------------+
Task A — De-duplicate retry events:
Some payments are retried and share the same (user_id, idempotency_key). Write ANSI SQL that keeps exactly one row per (user_id, idempotency_key), choosing the row with the earliest event_time; break ties by the smallest event_id. Return all columns of the kept rows.
Task B — Rank products by distinct purchasers for a given date:
Using the de-duplicated rows from Task A, write a single SQL query that returns, for the calendar date 2025-09-01 (UTC), the top 2 products by distinct purchasing users.
- Output columns: product_id, product_name, distinct_buyers, rank.
- Use window functions to compute ranks; break ties by product_id ASC.
- Do not use temporary tables; a CTE-based solution is acceptable.
Explain how your solution scales if events has 10^9 rows and how you would index to support it.
Quick Answer: This question evaluates proficiency in SQL data manipulation and analytical querying—specifically deduplication using idempotency keys, window functions/CTEs for ranking, distinct aggregation by date, timestamp tie-breaking, and awareness of indexing and scalability trade-offs in large tables; the domain is Data Manipulation (SQL/Python).