PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Coding & Algorithms/Yelp

Compute and Rank by Jaccard Similarity

Last updated: Mar 29, 2026

Quick Overview

The prompt evaluates set-based similarity metrics (Jaccard), robust tokenization and normalization for text data including Unicode and punctuation edge cases, streaming/iterator processing for very long inputs, and algorithmic efficiency for top-k retrieval and stable ranking.

  • Medium
  • Yelp
  • Coding & Algorithms
  • Data Scientist

Compute and Rank by Jaccard Similarity

Company: Yelp

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

Implement two functions: (1) jaccard_similarity(s1, s2) that tokenizes by splitting on any non-alphanumeric character, lowercases, treats tokens as sets (duplicates ignored), and returns |A ∩ B| / |A ∪ B| with 0.0 when the union is empty; (2) rank_by_jaccard(reviews, target) that returns the reviews sorted descending by similarity to target, stable on ties by original index. Discuss how you would handle very long inputs (streaming/iterators), Unicode and punctuation edge cases, and how to achieve O(T) expected time where T is total tokens. Finally, extend rank_by_jaccard to return the top-k efficiently for k ≪ n without computing full similarities for all n reviews; justify your data structures and complexity.

Quick Answer: The prompt evaluates set-based similarity metrics (Jaccard), robust tokenization and normalization for text data including Unicode and punctuation edge cases, streaming/iterator processing for very long inputs, and algorithmic efficiency for top-k retrieval and stable ranking.

Yelp logo
Yelp
Oct 13, 2025, 9:49 PM
Data Scientist
Onsite
Coding & Algorithms
3
0

Implement two functions: (1) jaccard_similarity(s1, s2) that tokenizes by splitting on any non-alphanumeric character, lowercases, treats tokens as sets (duplicates ignored), and returns |A ∩ B| / |A ∪ B| with 0.0 when the union is empty; (2) rank_by_jaccard(reviews, target) that returns the reviews sorted descending by similarity to target, stable on ties by original index. Discuss how you would handle very long inputs (streaming/iterators), Unicode and punctuation edge cases, and how to achieve O(T) expected time where T is total tokens. Finally, extend rank_by_jaccard to return the top-k efficiently for k ≪ n without computing full similarities for all n reviews; justify your data structures and complexity.

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More Yelp•More Data Scientist•Yelp Data Scientist•Yelp Coding & Algorithms•Data Scientist Coding & Algorithms
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.