PracHub
QuestionsCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates string tokenization, set-based similarity metrics (Jaccard index) and basic set operations, testing competency in text processing and algorithmic reasoning within the Coding & Algorithms domain and requiring practical application of these concepts.

  • medium
  • Moveworks
  • Coding & Algorithms
  • Software Engineer

Compute Jaccard similarity between two strings

Company: Moveworks

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Onsite

## Jaccard Similarity of Two Strings Given two strings `a` and `b`, compute their **Jaccard similarity** based on token sets. ### Tokenization rules - Convert to lowercase. - Split on any non-alphabetic character (e.g., spaces, punctuation). - Discard empty tokens. - Treat each string as a **set** of unique tokens (ignore duplicates). ### Jaccard similarity Let `A` be the token set from `a` and `B` from `b`. \[ J(A,B) = \frac{|A \cap B|}{|A \cup B|} \] ### Output Return the similarity as a floating-point number. ### Edge cases - If both sets are empty, define similarity as `1.0`. ### Example - `a = "I like coffee, coffee"` - `b = "coffee is great"` - `A = {i, like, coffee}` - `B = {coffee, is, great}` - Intersection size = 1, Union size = 5 → similarity = `0.2`

Quick Answer: This question evaluates string tokenization, set-based similarity metrics (Jaccard index) and basic set operations, testing competency in text processing and algorithmic reasoning within the Coding & Algorithms domain and requiring practical application of these concepts.

Given two strings `a` and `b`, compute their **Jaccard similarity** based on token sets. ### Tokenization rules - Convert to lowercase. - Split on any non-alphabetic character (spaces, punctuation, digits, etc.). - Discard empty tokens. - Treat each string as a **set** of unique tokens (duplicates are ignored). ### Jaccard similarity Let `A` be the token set from `a` and `B` the token set from `b`. ``` J(A, B) = |A ∩ B| / |A ∪ B| ``` Return the similarity as a floating-point number. ### Edge case - If **both** token sets are empty, define the similarity as `1.0`. ### Example - `a = "I like coffee, coffee"` - `b = "coffee is great"` - `A = {i, like, coffee}`, `B = {coffee, is, great}` - Intersection size = 1, Union size = 5 → similarity = `0.2`

Constraints

  • 0 <= len(a), len(b)
  • Strings may contain letters, digits, punctuation, and whitespace.
  • Tokenization is case-insensitive; split on any non-alphabetic character.
  • If both token sets are empty, the result is defined as 1.0.

Examples

Input: ("I like coffee, coffee", "coffee is great")

Expected Output: 0.2

Explanation: A = {i, like, coffee}, B = {coffee, is, great}. Intersection {coffee} = 1, union size = 5 → 1/5 = 0.2.

Input: ("", "")

Expected Output: 1.0

Explanation: Both token sets are empty, so by definition the similarity is 1.0.

Input: ("hello world", "hello world")

Expected Output: 1.0

Explanation: Identical token sets {hello, world}; intersection equals union, so 2/2 = 1.0.

Input: ("apple", "banana")

Expected Output: 0.0

Explanation: Disjoint sets {apple} and {banana}; intersection 0, union 2 → 0.0.

Input: ("The quick brown fox", "the QUICK red fox")

Expected Output: 0.6

Explanation: Case-insensitive: A = {the, quick, brown, fox}, B = {the, quick, red, fox}. Intersection {the, quick, fox} = 3, union size = 5 → 3/5 = 0.6.

Input: ("a.b.c", "b,c,d")

Expected Output: 0.5

Explanation: Punctuation splits tokens: A = {a, b, c}, B = {b, c, d}. Intersection {b, c} = 2, union {a, b, c, d} = 4 → 0.5.

Input: ("only here", "")

Expected Output: 0.0

Explanation: A = {only, here}, B = {}. Intersection 0, union 2 → 0.0. Only one side empty, so not the 1.0 special case.

Input: ("123 456", "hello")

Expected Output: 0.0

Explanation: Digits are non-alphabetic, so "123 456" yields an empty token set. A = {}, B = {hello}; intersection 0, union 1 → 0.0.

Hints

  1. Lowercase each string, then split on runs of non-alphabetic characters and drop empty pieces to build a set of unique tokens.
  2. Use set intersection and union: similarity = |A ∩ B| / |A ∪ B|.
  3. Handle the special case first: if the union is empty (both sets empty), return 1.0 to avoid dividing by zero.
Last updated: Jun 26, 2026

Related Coding Questions

  • Find a secret word via match feedback - Moveworks (medium)
  • Select next Hangman letter - Moveworks (Medium)

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.