Identify Pirate-Themed Custom Themes Using Jaccard Similarity
Company: Shopify
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates understanding of similarity metrics (Jaccard similarity), set-based data representations, and the competency to implement and reason about list- or token-level similarity comparisons.
Constraints
- 0 <= len(pirate_themes), len(custom_themes) <= 10^4
- Each theme dict contains keys: 'name' (unique string) and 'tags' (list of strings)
- 0 <= len(tags) <= 10^3 per theme; duplicates in 'tags' are ignored
- 0.0 <= threshold <= 1.0
- Tag comparison is case-sensitive
- Return names in the same order as they appear in custom_themes
Solution
def find_similar_custom_themes(pirate_themes: list[dict], custom_themes: list[dict], threshold: float) -> list[str]:
# Precompute pirate tag sets
pirate_sets = [set(t.get("tags", [])) for t in pirate_themes]
def jaccard(a: set, b: set) -> float:
# Both empty => similarity 1.0
union_size = len(a | b)
if union_size == 0:
return 1.0
inter_size = len(a & b)
return inter_size / union_size
result = []
for c in custom_themes:
cset = set(c.get("tags", []))
for pset in pirate_sets:
if jaccard(pset, cset) >= threshold:
result.append(c.get("name"))
break
return result
Explanation
Time complexity: O(P * C * T), where P is the number of pirate themes, C is the number of custom themes, and T is the average number of unique tags per theme (cost of set operations).. Space complexity: O(P * T) for storing pirate tag sets; O(1) additional space aside from input and output..
Hints
- Convert each theme's tags list to a set to ignore duplicates.
- Jaccard similarity is len(A ∩ B) / len(A ∪ B); if the union is empty, treat similarity as 1.0.
- Precompute sets for pirate themes to avoid repeated conversions.
- Short-circuit once any pirate theme meets the threshold for a custom theme.
- For large-scale data, consider MinHash/LSH or inverted-index filtering to reduce pairwise comparisons.