Calculate Jaccard Similarity Score for Two String Lists
Company: Shopify
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Onsite
Quick Answer: This question evaluates understanding of similarity metrics and competency in comparing string collections, reflecting knowledge of set-theoretic concepts and manipulation of collection data structures used in data processing.
Constraints
- 0 <= len(listA), len(listB) <= 10^5
- Each tag is a non-empty string of printable ASCII characters.
- Tags are case-sensitive ("Sale" and "sale" are distinct).
- Duplicate tags within the same list count once.
- If both lists are empty (union is empty), return 0.0.
Examples
Input: (['a', 'b', 'c'], ['b', 'c', 'd'])
Expected Output: 0.5
Explanation: Intersection {b, c} has size 2; union {a, b, c, d} has size 4; 2/4 = 0.5.
Input: (['shoes', 'apparel', 'sale'], ['shoes', 'apparel', 'sale'])
Expected Output: 1.0
Explanation: Identical tag sets: intersection equals union, so the score is 1.0.
Input: (['x', 'y'], ['p', 'q'])
Expected Output: 0.0
Explanation: No shared tags: intersection is empty (size 0), so the score is 0.0.
Input: ([], [])
Expected Output: 0.0
Explanation: Both lists empty: the union is empty, so we return 0.0 to avoid dividing by zero.
Input: (['a', 'a', 'b'], ['a', 'b', 'b'])
Expected Output: 1.0
Explanation: Duplicates collapse: both become {a, b}; intersection equals union, score 1.0.
Input: (['a', 'b', 'c', 'd'], ['c', 'd'])
Expected Output: 0.5
Explanation: Second list is a subset: intersection {c, d} size 2, union {a, b, c, d} size 4; 2/4 = 0.5.
Input: (['tag1'], [])
Expected Output: 0.0
Explanation: One empty list: intersection is empty, union is {tag1} size 1; 0/1 = 0.0.
Hints
- Convert both lists to sets to discard duplicates and enable fast membership tests.
- The intersection size is |A ∩ B| and the union size is |A ∪ B|; the score is their ratio.
- Guard against division by zero: when the union is empty (both lists empty), return 0.0.