Compute and Rank by Jaccard Similarity

Q: Compute and Rank by Jaccard Similarity

This is a Coding & Algorithms interview question from Yelp for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Implement two functions: (1) jaccard_similarity(s1, s2) that tokenizes by splitting on any non-alphanumeric character, lowercases, treats tokens as sets (duplicates ignored), and returns |A ∩ B| / |A ∪ B| with 0.0 when the union is empty; (2) rank_by_jaccard(reviews, target) that returns the reviews sorted descending by similarity to target, stable on ties by original index. Discuss how you would handle very long inputs (streaming/iterators), Unicode and punctuation edge cases, and how to achieve O(T) expected time where T is total tokens. Finally, extend rank_by_jaccard to return the top-k efficiently for k ≪ n without computing full similarities for all n reviews; justify your data structures and complexity.

Compute and Rank by Jaccard Similarity

Comments (0)