Compute Jaccard similarity between two strings
Company: Moveworks
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
Quick Answer: This question evaluates string tokenization, set-based similarity metrics (Jaccard index) and basic set operations, testing competency in text processing and algorithmic reasoning within the Coding & Algorithms domain and requiring practical application of these concepts.
Constraints
- 0 <= len(a), len(b)
- Strings may contain letters, digits, punctuation, and whitespace.
- Tokenization is case-insensitive; split on any non-alphabetic character.
- If both token sets are empty, the result is defined as 1.0.
Examples
Input: ("I like coffee, coffee", "coffee is great")
Expected Output: 0.2
Explanation: A = {i, like, coffee}, B = {coffee, is, great}. Intersection {coffee} = 1, union size = 5 → 1/5 = 0.2.
Input: ("", "")
Expected Output: 1.0
Explanation: Both token sets are empty, so by definition the similarity is 1.0.
Input: ("hello world", "hello world")
Expected Output: 1.0
Explanation: Identical token sets {hello, world}; intersection equals union, so 2/2 = 1.0.
Input: ("apple", "banana")
Expected Output: 0.0
Explanation: Disjoint sets {apple} and {banana}; intersection 0, union 2 → 0.0.
Input: ("The quick brown fox", "the QUICK red fox")
Expected Output: 0.6
Explanation: Case-insensitive: A = {the, quick, brown, fox}, B = {the, quick, red, fox}. Intersection {the, quick, fox} = 3, union size = 5 → 3/5 = 0.6.
Input: ("a.b.c", "b,c,d")
Expected Output: 0.5
Explanation: Punctuation splits tokens: A = {a, b, c}, B = {b, c, d}. Intersection {b, c} = 2, union {a, b, c, d} = 4 → 0.5.
Input: ("only here", "")
Expected Output: 0.0
Explanation: A = {only, here}, B = {}. Intersection 0, union 2 → 0.0. Only one side empty, so not the 1.0 special case.
Input: ("123 456", "hello")
Expected Output: 0.0
Explanation: Digits are non-alphabetic, so "123 456" yields an empty token set. A = {}, B = {hello}; intersection 0, union 1 → 0.0.
Hints
- Lowercase each string, then split on runs of non-alphabetic characters and drop empty pieces to build a set of unique tokens.
- Use set intersection and union: similarity = |A ∩ B| / |A ∪ B|.
- Handle the special case first: if the union is empty (both sets empty), return 1.0 to avoid dividing by zero.