PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates understanding of vector-space similarity and text representation by asking for a cosine similarity computation over string-derived vectors, testing coding and algorithmic skills relevant to data science and text processing.

  • Medium
  • Shopify
  • Coding & Algorithms
  • Data Scientist

Implement Cosine Similarity Function for String Vectors

Company: Shopify

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

##### Scenario Technical phone screen in Python; assess ability to implement similarity metric. ##### Question Implement a Python function that computes the cosine similarity between two strings (treat each string as a bag-of-words vector). ##### Hints Tokenize, count word frequencies, form vectors, dot product divided by magnitudes; handle empty inputs.

Quick Answer: This question evaluates understanding of vector-space similarity and text representation by asking for a cosine similarity computation over string-derived vectors, testing coding and algorithmic skills relevant to data science and text processing.

Implement a function that computes the cosine similarity between two strings, treating each string as a bag-of-words vector. Tokenize each string into lowercase alphanumeric words, count word frequencies to form a term-frequency vector, then return the dot product of the two vectors divided by the product of their magnitudes. Normalize by lowercasing and ignoring punctuation so that "Hello, World!" and "hello world" are identical. If either string contains no words (e.g. empty or punctuation-only), return 0.0. Round the result to 6 decimal places. Examples: - solution("the cat", "the dog") -> 0.5 (vectors share only the word "the") - solution("data science", "data science is fun") -> 0.707107 - solution("", "hello world") -> 0.0

Constraints

  • 0 <= len(a), len(b) <= 10^5
  • Strings may contain letters, digits, punctuation, and whitespace.
  • Tokenization is case-insensitive and considers maximal runs of [a-z0-9] as words.
  • Return a float rounded to 6 decimal places; return 0.0 when either string yields no tokens.

Examples

Input: ("the cat sat", "the cat sat")

Expected Output: 1.0

Explanation: Identical bag-of-words vectors point in the same direction, so cosine similarity is exactly 1.0.

Input: ("the cat", "the dog")

Expected Output: 0.5

Explanation: Both vectors are [1,1] over their union {the,cat,dog}; they share only "the". dot=1, magnitudes=sqrt(2) each, so 1/2 = 0.5.

Input: ("apple banana", "cherry date")

Expected Output: 0.0

Explanation: No shared words, so the dot product is 0 and similarity is 0.0.

Input: ("", "hello world")

Expected Output: 0.0

Explanation: The first string yields no tokens, so the function short-circuits to 0.0 (no division by zero).

Input: ("", "")

Expected Output: 0.0

Explanation: Both strings are empty; with no tokens on either side the result is 0.0.

Input: ("the the the cat", "the cat cat cat")

Expected Output: 0.6

Explanation: Counts are {the:3,cat:1} and {the:1,cat:3}. dot=3*1+1*3=6, each magnitude=sqrt(9+1)=sqrt(10), so 6/10 = 0.6 — frequency weighting matters.

Input: ("Hello, World!", "hello world")

Expected Output: 1.0

Explanation: Case and punctuation are normalized away, so the two strings tokenize identically and similarity is 1.0.

Input: ("data science", "data science is fun")

Expected Output: 0.707107

Explanation: dot=2 (both share data, science), magnitudes sqrt(2) and sqrt(4)=2, so 2/(sqrt(2)*2)=0.707107 after rounding to 6 dp.

Hints

  1. Tokenize each string into lowercase words (a simple approach: split on non-alphanumeric characters and drop empties), then build a frequency Counter for each.
  2. Cosine similarity = dot(u, v) / (|u| * |v|). The dot product only needs the words common to both vectors; the magnitude is sqrt(sum of squared counts).
  3. Guard the empty/no-token case explicitly to avoid dividing by a zero magnitude, and round the final value to 6 decimal places.
Last updated: Jun 25, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Grid Robot Command Simulator - Shopify (medium)
  • Compute Theme Similarity - Shopify (medium)
  • Compute Jaccard Similarity for Lists - Shopify (medium)
  • Implement URL Shortening Codec - Shopify (medium)
  • Simulate a rover fleet - Shopify (medium)