How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Shopify.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Shopify during technical interviews.

Compute Jaccard Similarity for Lists | Shopify Coding Question

Compute Jaccard Similarity for Lists

Company: Shopify

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Implement a Python function to compute the Jaccard similarity between two lists of strings. ```python def jaccard_similarity(a: list[str], b: list[str]) -> float: pass ``` ### Requirements - Treat each input list as a set of unique strings; duplicate strings within the same list should not affect the result. - Jaccard similarity is defined as: `size(intersection of unique strings) / size(union of unique strings)` - String comparison is case-sensitive, and whitespace is significant. - If both lists are empty, return `1.0`. - If exactly one list is empty, return `0.0`. - Do not use third-party libraries. - Aim for `O(n + m)` time complexity, where `n` and `m` are the lengths of the two input lists. ### Examples ```python jaccard_similarity(["red", "blue", "blue"], ["blue", "green"]) == 1 / 3 jaccard_similarity(["a", "b"], ["a", "b"]) == 1.0 jaccard_similarity([], ["a"]) == 0.0 jaccard_similarity([], []) == 1.0 ```

Quick Answer: This question evaluates understanding of Jaccard similarity and set-based operations, assessing competency in deduplicating lists, precise string comparisons (case-sensitivity and whitespace), and handling edge cases.

Write a function that computes the Jaccard similarity between two lists of strings. Treat each list as a set of unique strings, so duplicate values within the same list do not change the result. The Jaccard similarity is defined as the size of the intersection of the unique strings divided by the size of the union of the unique strings. String comparison is case-sensitive, and whitespace is significant. If both lists are empty, return 1.0. If exactly one list is empty, return 0.0. Your goal is to solve this in O(n + m) average time, where n and m are the lengths of the two input lists.

Constraints

0 <= len(a), len(b) <= 100000
Each element of `a` and `b` is a string
Duplicate strings within the same list should be ignored when computing similarity
String comparison is case-sensitive, and whitespace is significant
Do not use third-party libraries

Examples

Input: (["red", "blue", "blue"], ["blue", "green"])

Expected Output: 0.3333333333333333

Explanation: The unique sets are {"red", "blue"} and {"blue", "green"}. Their intersection has size 1 and their union has size 3, so the result is 1/3.

Input: (["a", "b"], ["a", "b"])

Expected Output: 1.0

Explanation: Both lists contain the same unique strings, so the intersection and union both have size 2.

Input: ([], ["a"])

Expected Output: 0.0

Explanation: Exactly one list is empty, so the similarity is defined as 0.0.

Input: ([], [])

Expected Output: 1.0

Explanation: Both lists are empty, so the similarity is defined as 1.0.

Input: (["A", "a", "x "], ["a", "x"])

Expected Output: 0.25

Explanation: Comparison is case-sensitive and whitespace is significant. The unique sets are {"A", "a", "x "} and {"a", "x"}. Their intersection has size 1 and union has size 4, so the result is 1/4.

Hints

Convert each list into a set first so duplicates do not affect the answer.
Be careful with the special case where both sets are empty, because the union size would be 0.

Quick Overview

This question evaluates understanding of Jaccard similarity and set-based operations, assessing competency in deduplicating lists, precise string comparisons (case-sensitivity and whitespace), and handling edge cases.