Write SQL to compute max-overlap lists

Q: Write SQL to compute max-overlap lists

This is a Data Manipulation (SQL/Python) interview question from Pinterest for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

Invented schema and sample data below. Assume 'today' is 2025-09-01 and 'last 7 days' means 2025-08-26 through 2025-09-01 inclusive. Only consider lists whose created_at falls in that window.

Schema:

lists(list_id INT PRIMARY KEY, owner_id INT, created_at DATE)
list_items(list_id INT, item VARCHAR)

Sample tables: lists +---------+----------+------------+ | list_id | owner_id | created_at | +---------+----------+------------+ | 1 | 10 | 2025-08-30 | | 2 | 11 | 2025-08-31 | | 3 | 12 | 2025-09-01 | +---------+----------+------------+

list_items +---------+------+ | list_id | item | +---------+------+ | 1 | A | | 1 | B | | 1 | C | | 2 | A | | 2 | C | | 2 | D | | 3 | B | | 3 | C | | 3 | E | +---------+------+

Task: Write a single SQL query that returns exactly one row: (list_id_small, list_id_large, overlap_count, jaccard) for the unordered pair of lists with the maximum item overlap among eligible lists. Define overlap_count = |items(Li) ∩ items(Lj)| and jaccard = overlap_count / |items(Li) ∪ items(Lj)|. Break ties by higher jaccard, then by (list_id_small, list_id_large) ascending. For the sample data above, the expected row is (1, 2, 2, 0.5). Requirements: 1) Do not count duplicate items within a list more than once. 2) Use standard SQL constructs (CTEs/window functions allowed). 3) Explain indexes you would add to make this fast at scale. 4) Provide a variant that returns the top-5 pairs.

Write SQL to compute max-overlap lists

Comments (0)