How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a Medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Shopify.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Shopify during technical interviews.

Implement Cosine Similarity Function for String Vectors

Company: Shopify

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

##### Scenario Technical phone screen in Python; assess ability to implement similarity metric. ##### Question Implement a Python function that computes the cosine similarity between two strings (treat each string as a bag-of-words vector). ##### Hints Tokenize, count word frequencies, form vectors, dot product divided by magnitudes; handle empty inputs.

Quick Answer: This question evaluates understanding of vector-space similarity and text representation by asking for a cosine similarity computation over string-derived vectors, testing coding and algorithmic skills relevant to data science and text processing.

Implement a function that computes the cosine similarity between two strings, treating each string as a bag-of-words vector. Tokenize each string into lowercase alphanumeric words, count word frequencies to form a term-frequency vector, then return the dot product of the two vectors divided by the product of their magnitudes. Normalize by lowercasing and ignoring punctuation so that "Hello, World!" and "hello world" are identical. If either string contains no words (e.g. empty or punctuation-only), return 0.0. Round the result to 6 decimal places. Examples: - solution("the cat", "the dog") -> 0.5 (vectors share only the word "the") - solution("data science", "data science is fun") -> 0.707107 - solution("", "hello world") -> 0.0

Constraints

0 <= len(a), len(b) <= 10^5
Strings may contain letters, digits, punctuation, and whitespace.
Tokenization is case-insensitive and considers maximal runs of [a-z0-9] as words.
Return a float rounded to 6 decimal places; return 0.0 when either string yields no tokens.

Examples

Input: ("the cat sat", "the cat sat")

Expected Output: 1.0

Explanation: Identical bag-of-words vectors point in the same direction, so cosine similarity is exactly 1.0.

Input: ("the cat", "the dog")

Expected Output: 0.5

Explanation: Both vectors are [1,1] over their union {the,cat,dog}; they share only "the". dot=1, magnitudes=sqrt(2) each, so 1/2 = 0.5.

Input: ("apple banana", "cherry date")

Expected Output: 0.0

Explanation: No shared words, so the dot product is 0 and similarity is 0.0.

Input: ("", "hello world")

Expected Output: 0.0

Explanation: The first string yields no tokens, so the function short-circuits to 0.0 (no division by zero).

Input: ("", "")

Expected Output: 0.0

Explanation: Both strings are empty; with no tokens on either side the result is 0.0.

Input: ("the the the cat", "the cat cat cat")

Expected Output: 0.6

Explanation: Counts are {the:3,cat:1} and {the:1,cat:3}. dot=3*1+1*3=6, each magnitude=sqrt(9+1)=sqrt(10), so 6/10 = 0.6 — frequency weighting matters.

Input: ("Hello, World!", "hello world")

Expected Output: 1.0

Explanation: Case and punctuation are normalized away, so the two strings tokenize identically and similarity is 1.0.

Input: ("data science", "data science is fun")

Expected Output: 0.707107

Explanation: dot=2 (both share data, science), magnitudes sqrt(2) and sqrt(4)=2, so 2/(sqrt(2)*2)=0.707107 after rounding to 6 dp.

Hints

Tokenize each string into lowercase words (a simple approach: split on non-alphanumeric characters and drop empties), then build a frequency Counter for each.
Cosine similarity = dot(u, v) / (|u| * |v|). The dot product only needs the words common to both vectors; the magnitude is sqrt(sum of squared counts).
Guard the empty/no-token case explicitly to avoid dividing by a zero magnitude, and round the final value to 6 decimal places.

Quick Overview

This question evaluates understanding of vector-space similarity and text representation by asking for a cosine similarity computation over string-derived vectors, testing coding and algorithmic skills relevant to data science and text processing.