Compute Theme Similarity
Company: Shopify
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Implement a small set of Python functions to detect likely pirated Shopify themes by comparing their extracted feature sets using the **Jaccard similarity coefficient**.
The Jaccard similarity between two sets $A$ and $B$ is defined as:
$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$
This problem has two parts: Part 1 builds the core similarity primitive, and Part 2 reuses it to scan a catalog of custom themes against a set of known pirated themes.
### Constraints & Assumptions
- Inputs are plain Python lists of strings (Part 1) and lists of dictionaries (Part 2). No external libraries beyond the standard library are required.
- `features` strings are treated as opaque tokens (extracted theme attributes, asset names, file hashes, CSS classes, etc.); equality is exact string equality.
- Lists may contain duplicate strings; duplicates must not affect the score (the inputs are conceptually sets).
- A single comparison runs over modest feature lists (tens to a few hundred tokens). The full pirated-vs-custom scan is $O(P \times C)$ pairwise comparisons, where $P$ and $C$ are the number of pirated and custom themes respectively — assume both fit comfortably in memory.
- Functions must not mutate their input objects, and output ordering must be deterministic across runs.
### Clarifying Questions to Ask
- Are `features` strings case-sensitive, and should they be normalized (trimmed/lowercased) before comparison, or compared exactly as given?
- Is the threshold comparison inclusive (`>= threshold`) or strict (`> threshold`)? Should `threshold` be validated to lie in `[0, 1]`?
- Can the same custom theme appear more than once in the input, and if so should it be flagged once or per occurrence?
- For a flagged theme, do we need only the single best pirated match, or all pirated matches at or above the threshold?
- Are `theme_id` values guaranteed unique and non-null within each list?
### Part 1: Basic Jaccard similarity
Given two lists of strings, implement a function that returns their Jaccard similarity.
Requirements:
- Return a `float` in the closed interval `[0, 1]`.
- Treat the input lists as sets, so duplicate strings must not change the score.
- If **both** lists are empty, return `1.0` (two empty sets are identical).
- If **exactly one** list is empty, return `0.0`.
Example:
```python
list_1 = ["a", "b", "c", "d"]
list_2 = ["b", "c", "d", "e"]
# intersection = {b, c, d} (size 3), union = {a, b, c, d, e} (size 5)
# expected score = 3 / 5 = 0.6
```
```hint Data structure
Converting each list to a `set` gives you O(1) membership and built-in `&` (intersection) and `|` (union) operators, and it deduplicates for free.
```
```hint Edge cases
Decide the empty-input behavior *before* dividing — when the union is empty (both inputs empty) the formula is $0/0$, so you must short-circuit it. Map "both empty → 1.0" and "one empty → 0.0" explicitly rather than relying on the division.
```
#### What This Part Should Cover
- A correct, deduplicating implementation that maps the formula directly onto `set` operations.
- All empty-input edge cases handled explicitly, with the division-by-zero case ($0/0$) short-circuited before any arithmetic.
- A guaranteed `float` result in `[0, 1]`, plus awareness that comparing floats with `>=` near a boundary can invite rounding surprises.
### Part 2: Identify likely pirated custom themes
You are given two lists of theme dictionaries:
```python
pirated_themes = [
{"theme_id": "p1", "features": ["a", "b", "c"]},
{"theme_id": "p2", "features": ["x", "y", "z"]},
]
custom_themes = [
{"theme_id": "c1", "features": ["a", "b", "d"]},
{"theme_id": "c2", "features": ["x", "y", "z"]},
]
```
Each theme dictionary contains:
- `theme_id`: a unique theme identifier (string).
- `features`: a list of strings representing extracted theme signals. This key may be missing or its value may be an empty list.
Implement:
```python
def find_likely_pirated_themes(pirated_themes, custom_themes, threshold=0.8):
...
```
For each custom theme:
1. Compare it against **every** known pirated theme using the Part 1 Jaccard similarity over `features`.
2. Find the single best-matching pirated theme (highest similarity).
3. Flag the custom theme if its best similarity score is **greater than or equal to** `threshold`.
For each flagged custom theme, the returned result must include:
- `custom_theme_id`
- `matched_pirated_theme_id`
- `similarity_score`
Additional requirements:
- Handle missing or empty `features` gracefully (a missing key must not raise).
- Do not mutate any input object.
- Make the output deterministic (a stable, well-defined ordering and a defined tie-break rule when two pirated themes score equally).
```hint Where to start
Reuse your Part 1 function as the inner kernel. For each custom theme, loop over all pirated themes and keep a running "best" (score and the pirated `theme_id` that produced it).
```
```hint Robustness
Use `theme.get("features", [])` so a missing key yields an empty list instead of a `KeyError`. An empty custom feature set should never falsely match — your Part 1 empty-set rules already guarantee it scores `0.0` against any *non-empty* pirated theme.
```
```hint Determinism & tie-breaks
If the input order of `pirated_themes` is not guaranteed stable, two themes can tie for "best". Pick an explicit tie-break — e.g. iterate in a fixed order and only replace the best on a *strictly greater* score, or break ties by `theme_id`. Document whichever rule you choose so the output is reproducible.
```
#### Clarifying Questions for this Part
- When a pirated theme's own `features` is empty, should it be skipped as a candidate entirely? (If not, Part 1's "both empty → 1.0" rule could make an empty custom theme match it and clear any `threshold <= 1.0`.)
- Should a custom theme with missing/empty `features` ever appear in the output, or always be excluded?
#### What This Part Should Cover
- Clean reuse of the Part 1 function as the inner kernel rather than re-deriving the union/intersection math.
- Graceful handling of missing / `None` / empty `features` on both sides (no exceptions) and no mutation of any input.
- An explicit, *stated* tie-break rule together with a deterministic output ordering.
- A return shape matching the spec exactly (`custom_theme_id`, `matched_pirated_theme_id`, `similarity_score`), plus a guard that avoids emitting a bogus flag when `pirated_themes` is empty.
### What a Strong Answer Covers
These dimensions span both parts:
- Modular design where Part 2 is a thin scan layered on top of the single Part 1 primitive, with no duplicated similarity math.
- Awareness of overall correctness boundaries: scores live in `[0, 1]`, float-comparison caution near the `threshold`, and the $O(P \times C)$ cost of the all-pairs scan.
- A few well-chosen test cases that exercise both parts together: identical sets, disjoint sets, partial overlap, empty/missing features, and duplicate tokens.
### Follow-up Questions
- The naive scan is $O(P \times C)$ pairwise comparisons. If there were millions of themes, how would you make near-duplicate detection scalable (e.g. MinHash + Locality-Sensitive Hashing, or inverted indexes on features)?
- Jaccard weights every feature equally. How would you incorporate the fact that some features (a rare custom file hash) are far more discriminative than common ones (a generic CSS class) — e.g. TF-IDF weighting or weighted Jaccard?
- How would you choose and validate the `threshold` in production — what precision/recall trade-off, and how would you measure it against labeled piracy cases?
- The `features` here are exact tokens. How would you extend this to detect *near*-identical assets (minified CSS with renamed classes, re-encoded images) rather than only exact-match signals?
Quick Answer: This question evaluates understanding of set-based similarity metrics (the Jaccard coefficient) and practical coding skills for transforming and comparing feature token sets, targeting algorithmic thinking and data representation competencies within the Coding & Algorithms domain for a Data Scientist role.
Jaccard Similarity of Two Token Lists
Implement a function that computes the **Jaccard similarity coefficient** between two lists of strings, treating each list as a set.
The Jaccard similarity between two sets $A$ and $B$ is:
$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$
**Requirements:**
- Return a `float` in the closed interval `[0, 1]`.
- Treat the input lists as sets, so duplicate strings must not change the score.
- Tokens are compared by exact string equality (case-sensitive).
- If **both** lists are empty, return `1.0` (two empty sets are identical).
- If **exactly one** list is empty, return `0.0`.
- Do not mutate the inputs.
**Example:**
```
list_a = ["a", "b", "c", "d"]
list_b = ["b", "c", "d", "e"]
# intersection = {b, c, d} (size 3), union = {a, b, c, d, e} (size 5)
# expected = 3 / 5 = 0.6
```
Constraints
- Inputs are plain lists of strings (may be empty).
- Lists may contain duplicate strings; duplicates must not affect the score.
- Tokens are compared by exact, case-sensitive string equality.
- Both empty -> 1.0; exactly one empty -> 0.0.
- Result is a float in the closed interval [0, 1].
Examples
Input: (["a", "b", "c", "d"], ["b", "c", "d", "e"])
Expected Output: 0.6
Explanation: intersection {b,c,d}=3, union {a,b,c,d,e}=5, so 3/5 = 0.6.
Input: ([], [])
Expected Output: 1.0
Explanation: Two empty sets are identical by definition.
Input: (["a"], [])
Expected Output: 0.0
Explanation: Exactly one side empty: no possible overlap.
Input: (["a", "a", "b"], ["a", "b"])
Expected Output: 1.0
Explanation: Duplicates are ignored: both reduce to {a, b}, so the score is 1.0.
Input: (["a", "b"], ["c", "d"])
Expected Output: 0.0
Explanation: Disjoint sets: empty intersection over union of size 4 = 0.0.
Input: (["x", "y", "z"], ["x", "y", "z"])
Expected Output: 1.0
Explanation: Identical sets: intersection equals union, so 3/3 = 1.0.
Hints
- Converting each list to a set deduplicates for free and unlocks the & (intersection) and | (union) operators that mirror the formula directly.
- Handle the empty-input cases explicitly *before* dividing: when both inputs are empty the union is empty and the formula is the undefined 0/0. Map 'both empty -> 1.0' and 'one empty -> 0.0' as short-circuits.
- intersection / union is true division in Python 3, so the result is always a float, and since 0 <= |A & B| <= |A | B| it always lands in [0, 1].
Identify Likely Pirated Custom Themes
You are given two lists of theme dictionaries. Each theme dict has a `theme_id` (unique string) and a `features` key whose value is a list of strings (this key may be **missing**, `None`, or an empty list).
Implement `find_likely_pirated_themes(pirated_themes, custom_themes, threshold=0.8)`.
For each custom theme:
1. Compare it against **every** pirated theme using the Jaccard similarity over `features` (same primitive as Part 1).
2. Find the single best-matching pirated theme (highest similarity).
3. Flag the custom theme if its best similarity score is **>= threshold**.
For each flagged custom theme, the result dict must contain exactly:
- `custom_theme_id`
- `matched_pirated_theme_id`
- `similarity_score`
**Additional requirements:**
- Handle missing / `None` / empty `features` gracefully (a missing key must not raise).
- Do not mutate any input object.
- Output is deterministic: emit custom themes in input order; ties for 'best pirated match' are broken toward the **first** pirated theme in input order (use strict `>` so the earlier winner is kept).
- If `pirated_themes` is empty, never emit a bogus flag.
**Example:**
```
pirated = [{"theme_id": "p1", "features": ["a","b","c"]},
{"theme_id": "p2", "features": ["x","y","z"]}]
custom = [{"theme_id": "c1", "features": ["a","b","d"]}, # best vs p1: 0.5
{"theme_id": "c2", "features": ["x","y","z"]}, # best vs p2: 1.0
{"theme_id": "c3"}, # no features
{"theme_id": "c4", "features": []}] # empty features
# threshold=0.8 -> only c2 clears the bar.
```
Constraints
- Each theme dict has a unique 'theme_id' string.
- 'features' may be missing, None, or an empty list — must not raise.
- Inner kernel is the Part 1 Jaccard similarity over the 'features' lists.
- Flag a custom theme when its best score is >= threshold (inclusive).
- No input object may be mutated.
- Deterministic output: custom themes in input order; ties broken toward the first pirated theme (strict '>').
- Empty pirated_themes -> no flags emitted.
Examples
Input: ([{"theme_id": "p1", "features": ["a", "b", "c"]}, {"theme_id": "p2", "features": ["x", "y", "z"]}], [{"theme_id": "c1", "features": ["a", "b", "d"]}, {"theme_id": "c2", "features": ["x", "y", "z"]}, {"theme_id": "c3"}, {"theme_id": "c4", "features": []}], 0.8)
Expected Output: [{"custom_theme_id": "c2", "matched_pirated_theme_id": "p2", "similarity_score": 1.0}]
Explanation: c1 peaks at 0.5 (vs p1), c2 hits 1.0 (vs p2), c3/c4 have no features so score 0.0; only c2 clears 0.8.
Input: ([], [{"theme_id": "c1", "features": ["a", "b"]}], 0.8)
Expected Output: []
Explanation: No pirated themes means no candidate is ever set, so nothing is flagged.
Input: ([{"theme_id": "p1", "features": ["a", "b", "c"]}], [{"theme_id": "c1", "features": ["a", "b", "c"]}], 0.8)
Expected Output: [{"custom_theme_id": "c1", "matched_pirated_theme_id": "p1", "similarity_score": 1.0}]
Explanation: Identical feature sets score 1.0, clearing the threshold.
Input: ([{"theme_id": "p1", "features": ["a", "b", "c", "d"]}, {"theme_id": "p2", "features": ["a", "b", "c", "d"]}], [{"theme_id": "c1", "features": ["a", "b", "c", "d"]}], 0.5)
Expected Output: [{"custom_theme_id": "c1", "matched_pirated_theme_id": "p1", "similarity_score": 1.0}]
Explanation: Both pirated themes tie at 1.0; the strict '>' tie-break keeps p1, the first in input order.
Input: ([{"theme_id": "p1", "features": ["a", "b", "c"]}], [{"theme_id": "c1", "features": ["x", "y", "z"]}], 0.8)
Expected Output: []
Explanation: Disjoint feature sets score 0.0, below the threshold, so nothing is flagged.
Input: ([{"theme_id": "p1", "features": ["a", "b", "c", "d"]}], [{"theme_id": "c1", "features": ["a", "b", "c"]}], 0.75)
Expected Output: [{"custom_theme_id": "c1", "matched_pirated_theme_id": "p1", "similarity_score": 0.75}]
Explanation: intersection {a,b,c}=3 over union {a,b,c,d}=4 = 0.75, which exactly meets the inclusive >= threshold.
Hints
- Reuse the Part 1 Jaccard function as the inner kernel — don't re-derive the union/intersection math.
- Use theme.get('features') or [] so a missing key, None, or [] all resolve to an empty list; an empty custom feature set then scores 0.0 against any non-empty pirated theme and is never falsely flagged.
- Keep a running best (score, pirated_id) per custom theme and only replace it on a *strictly greater* score, which deterministically keeps the first pirated theme on a tie. Guard the final emit with 'best_pirated_id is not None' so an empty pirated list produces no bogus flag.