Write SQL to compute max-overlap lists
Company: Pinterest
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Onsite
Invented schema and sample data below. Assume 'today' is 2025-09-01 and 'last 7 days' means 2025-08-26 through 2025-09-01 inclusive. Only consider lists whose created_at falls in that window.
Schema:
- lists(list_id INT PRIMARY KEY, owner_id INT, created_at DATE)
- list_items(list_id INT, item VARCHAR)
Sample tables:
lists
+---------+----------+------------+
| list_id | owner_id | created_at |
+---------+----------+------------+
| 1 | 10 | 2025-08-30 |
| 2 | 11 | 2025-08-31 |
| 3 | 12 | 2025-09-01 |
+---------+----------+------------+
list_items
+---------+------+
| list_id | item |
+---------+------+
| 1 | A |
| 1 | B |
| 1 | C |
| 2 | A |
| 2 | C |
| 2 | D |
| 3 | B |
| 3 | C |
| 3 | E |
+---------+------+
Task: Write a single SQL query that returns exactly one row: (list_id_small, list_id_large, overlap_count, jaccard) for the unordered pair of lists with the maximum item overlap among eligible lists. Define overlap_count = |items(Li) ∩ items(Lj)| and jaccard = overlap_count / |items(Li) ∪ items(Lj)|. Break ties by higher jaccard, then by (list_id_small, list_id_large) ascending. For the sample data above, the expected row is (1, 2, 2, 0.5). Requirements: 1) Do not count duplicate items within a list more than once. 2) Use standard SQL constructs (CTEs/window functions allowed). 3) Explain indexes you would add to make this fast at scale. 4) Provide a variant that returns the top-5 pairs.
Quick Answer: This question evaluates proficiency in SQL data manipulation and set-based operations, focusing on deduplication, pairwise aggregation, and similarity metric computation (Jaccard) over relational tables.