Compute Jaccard Similarity for Lists
Company: Shopify
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates understanding of Jaccard similarity and set-based operations, assessing competency in deduplicating lists, precise string comparisons (case-sensitivity and whitespace), and handling edge cases.
Constraints
- 0 <= len(a), len(b) <= 100000
- Each element of `a` and `b` is a string
- Duplicate strings within the same list should be ignored when computing similarity
- String comparison is case-sensitive, and whitespace is significant
- Do not use third-party libraries
Examples
Input: (["red", "blue", "blue"], ["blue", "green"])
Expected Output: 0.3333333333333333
Explanation: The unique sets are {"red", "blue"} and {"blue", "green"}. Their intersection has size 1 and their union has size 3, so the result is 1/3.
Input: (["a", "b"], ["a", "b"])
Expected Output: 1.0
Explanation: Both lists contain the same unique strings, so the intersection and union both have size 2.
Input: ([], ["a"])
Expected Output: 0.0
Explanation: Exactly one list is empty, so the similarity is defined as 0.0.
Input: ([], [])
Expected Output: 1.0
Explanation: Both lists are empty, so the similarity is defined as 1.0.
Input: (["A", "a", "x "], ["a", "x"])
Expected Output: 0.25
Explanation: Comparison is case-sensitive and whitespace is significant. The unique sets are {"A", "a", "x "} and {"a", "x"}. Their intersection has size 1 and union has size 4, so the result is 1/4.
Hints
- Convert each list into a set first so duplicates do not affect the answer.
- Be careful with the special case where both sets are empty, because the union size would be 0.