Compute Jaccard similarity between two strings

Q: Compute Jaccard similarity between two strings

This question evaluates string tokenization, set-based similarity metrics (Jaccard index) and basic set operations, testing competency in text processing and algorithmic reasoning within the Coding & Algorithms domain and requiring practical application of these concepts.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Jaccard Similarity of Two Strings

Given two strings a and b, compute their Jaccard similarity based on token sets.

Tokenization rules

Convert to lowercase.
Split on any non-alphabetic character (e.g., spaces, punctuation).
Discard empty tokens.
Treat each string as a set of unique tokens (ignore duplicates).

Jaccard similarity

Let A be the token set from a and B from b.

$J(A,B) = \frac{|A \cap B|}{|A \cup B|}$

Output

Return the similarity as a floating-point number.

Edge cases

If both sets are empty, define similarity as 1.0 .

Example

a = "I like coffee, coffee"
b = "coffee is great"
A = {i, like, coffee}
B = {coffee, is, great}
Intersection size = 1, Union size = 5 → similarity = 0.2