Implement Cosine Similarity Function for String Vectors
Company: Shopify
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates understanding of vector-space similarity and text representation by asking for a cosine similarity computation over string-derived vectors, testing coding and algorithmic skills relevant to data science and text processing.
Constraints
- 0 <= len(a), len(b) <= 10^5
- Strings may contain letters, digits, punctuation, and whitespace.
- Tokenization is case-insensitive and considers maximal runs of [a-z0-9] as words.
- Return a float rounded to 6 decimal places; return 0.0 when either string yields no tokens.
Examples
Input: ("the cat sat", "the cat sat")
Expected Output: 1.0
Explanation: Identical bag-of-words vectors point in the same direction, so cosine similarity is exactly 1.0.
Input: ("the cat", "the dog")
Expected Output: 0.5
Explanation: Both vectors are [1,1] over their union {the,cat,dog}; they share only "the". dot=1, magnitudes=sqrt(2) each, so 1/2 = 0.5.
Input: ("apple banana", "cherry date")
Expected Output: 0.0
Explanation: No shared words, so the dot product is 0 and similarity is 0.0.
Input: ("", "hello world")
Expected Output: 0.0
Explanation: The first string yields no tokens, so the function short-circuits to 0.0 (no division by zero).
Input: ("", "")
Expected Output: 0.0
Explanation: Both strings are empty; with no tokens on either side the result is 0.0.
Input: ("the the the cat", "the cat cat cat")
Expected Output: 0.6
Explanation: Counts are {the:3,cat:1} and {the:1,cat:3}. dot=3*1+1*3=6, each magnitude=sqrt(9+1)=sqrt(10), so 6/10 = 0.6 — frequency weighting matters.
Input: ("Hello, World!", "hello world")
Expected Output: 1.0
Explanation: Case and punctuation are normalized away, so the two strings tokenize identically and similarity is 1.0.
Input: ("data science", "data science is fun")
Expected Output: 0.707107
Explanation: dot=2 (both share data, science), magnitudes sqrt(2) and sqrt(4)=2, so 2/(sqrt(2)*2)=0.707107 after rounding to 6 dp.
Hints
- Tokenize each string into lowercase words (a simple approach: split on non-alphanumeric characters and drop empties), then build a frequency Counter for each.
- Cosine similarity = dot(u, v) / (|u| * |v|). The dot product only needs the words common to both vectors; the magnitude is sqrt(sum of squared counts).
- Guard the empty/no-token case explicitly to avoid dividing by a zero magnitude, and round the final value to 6 decimal places.