Identify Pirate-Themed Custom Themes Using Jaccard Similarity
Company: Shopify
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
##### Scenario
Engineering receives two lists and must identify custom themes that look like Pirate themes.
##### Question
a) Write a Python function that computes the Jaccard similarity len(intersection)/len(union) between any two lists.
b) Given pirate_themes (list[dict]) and custom_themes (list[dict]), return custom themes whose similarity to at least one pirate theme exceeds a chosen threshold.
c) Describe how you would handle edge cases and scale this computation to millions of themes.
##### Hints
Use set operations; handle empty lists; consider MinHash or locality-sensitive hashing for scale.
Quick Answer: This question evaluates understanding of similarity metrics (Jaccard similarity), set-based data representations, and the competency to implement and reason about list- or token-level similarity comparisons.
You are given two lists: pirate_themes and custom_themes. Each element is a dict with keys 'name' (string) and 'tags' (list of strings). Define the Jaccard similarity between two tag lists as |intersection| / |union| using set semantics (duplicates ignored). If both tag lists are empty, define similarity as 1.0. Return the list of names of custom themes whose similarity to at least one pirate theme is greater than or equal to threshold. Preserve the order of custom_themes in the output. Tag comparison is case-sensitive. If pirate_themes is empty, return an empty list.
Constraints
- 0 <= len(pirate_themes), len(custom_themes) <= 10^4
- Each theme dict contains keys: 'name' (unique string) and 'tags' (list of strings)
- 0 <= len(tags) <= 10^3 per theme; duplicates in 'tags' are ignored
- 0.0 <= threshold <= 1.0
- Tag comparison is case-sensitive
- Return names in the same order as they appear in custom_themes
Hints
- Convert each theme's tags list to a set to ignore duplicates.
- Jaccard similarity is len(A ∩ B) / len(A ∪ B); if the union is empty, treat similarity as 1.0.
- Precompute sets for pirate themes to avoid repeated conversions.
- Short-circuit once any pirate theme meets the threshold for a custom theme.
- For large-scale data, consider MinHash/LSH or inverted-index filtering to reduce pairwise comparisons.