Implement Python functions to compare theme similarity using the Jaccard coefficient.
Part 1: Basic Jaccard similarity
Given two lists of strings, implement a function that returns their Jaccard similarity:
Jaccard(A, B) = size(intersection(A, B)) / size(union(A, B))
Requirements:
-
Return a float in
[0, 1]
.
-
Treat the input lists as sets, so duplicate strings should not affect the score.
-
If both lists are empty, return
1.0
.
-
If only one list is empty, return
0.0
.
Example:
list_1 = ["a", "b", "c", "d"]
list_2 = ["b", "c", "d", "e"]
The expected score is 3 / 5 = 0.6.
Part 2: Identify likely pirated custom themes
You are given two lists of dictionaries:
pirated_themes = [
{"theme_id": "p1", "features": ["a", "b", "c"]},
{"theme_id": "p2", "features": ["x", "y", "z"]}
]
custom_themes = [
{"theme_id": "c1", "features": ["a", "b", "d"]},
{"theme_id": "c2", "features": ["x", "y", "z"]}
]
Each theme dictionary contains:
-
theme_id
: a unique theme identifier.
-
features
: a list of strings representing extracted theme attributes, assets, file hashes, CSS classes, or other comparable signals.
Implement a function:
def find_likely_pirated_themes(pirated_themes, custom_themes, threshold=0.8):
...
For each custom theme:
-
Compare it with every known pirated theme using Jaccard similarity over
features
.
-
Find the best matching pirated theme.
-
Return custom themes whose best similarity score is greater than or equal to
threshold
.
The returned result should include, for each flagged custom theme:
-
custom_theme_id
-
matched_pirated_theme_id
-
similarity_score
Handle missing or empty features lists gracefully, do not mutate the input objects, and make the output deterministic.