How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Shopify.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Shopify during technical interviews.

Compute Theme Similarity | Shopify Coding Question

Quick Overview

This question evaluates understanding of set-based similarity metrics (the Jaccard coefficient) and practical coding skills for transforming and comparing feature token sets, targeting algorithmic thinking and data representation competencies within the Coding & Algorithms domain for a Data Scientist role.

Compute Theme Similarity

Company: Shopify

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Implement a small set of Python functions to detect likely pirated Shopify themes by comparing their extracted feature sets using the **Jaccard similarity coefficient**. The Jaccard similarity between two sets $A$ and $B$ is defined as: $$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$ This problem has two parts: Part 1 builds the core similarity primitive, and Part 2 reuses it to scan a catalog of custom themes against a set of known pirated themes. ### Constraints & Assumptions - Inputs are plain Python lists of strings (Part 1) and lists of dictionaries (Part 2). No external libraries beyond the standard library are required. - `features` strings are treated as opaque tokens (extracted theme attributes, asset names, file hashes, CSS classes, etc.); equality is exact string equality. - Lists may contain duplicate strings; duplicates must not affect the score (the inputs are conceptually sets). - A single comparison runs over modest feature lists (tens to a few hundred tokens). The full pirated-vs-custom scan is $O(P \times C)$ pairwise comparisons, where $P$ and $C$ are the number of pirated and custom themes respectively — assume both fit comfortably in memory. - Functions must not mutate their input objects, and output ordering must be deterministic across runs. ### Clarifying Questions to Ask - Are `features` strings case-sensitive, and should they be normalized (trimmed/lowercased) before comparison, or compared exactly as given? - Is the threshold comparison inclusive (`>= threshold`) or strict (`> threshold`)? Should `threshold` be validated to lie in `[0, 1]`? - Can the same custom theme appear more than once in the input, and if so should it be flagged once or per occurrence? - For a flagged theme, do we need only the single best pirated match, or all pirated matches at or above the threshold? - Are `theme_id` values guaranteed unique and non-null within each list? ### Part 1: Basic Jaccard similarity Given two lists of strings, implement a function that returns their Jaccard similarity. Requirements: - Return a `float` in the closed interval `[0, 1]`. - Treat the input lists as sets, so duplicate strings must not change the score. - If **both** lists are empty, return `1.0` (two empty sets are identical). - If **exactly one** list is empty, return `0.0`. Example: ```python list_1 = ["a", "b", "c", "d"] list_2 = ["b", "c", "d", "e"] # intersection = {b, c, d} (size 3), union = {a, b, c, d, e} (size 5) # expected score = 3 / 5 = 0.6 ``` ```hint Data structure Converting each list to a `set` gives you O(1) membership and built-in `&` (intersection) and `|` (union) operators, and it deduplicates for free. ``` ```hint Edge cases Decide the empty-input behavior *before* dividing — when the union is empty (both inputs empty) the formula is $0/0$, so you must short-circuit it. Map "both empty → 1.0" and "one empty → 0.0" explicitly rather than relying on the division. ``` #### What This Part Should Cover - A correct, deduplicating implementation that maps the formula directly onto `set` operations. - All empty-input edge cases handled explicitly, with the division-by-zero case ($0/0$) short-circuited before any arithmetic. - A guaranteed `float` result in `[0, 1]`, plus awareness that comparing floats with `>=` near a boundary can invite rounding surprises. ### Part 2: Identify likely pirated custom themes You are given two lists of theme dictionaries: ```python pirated_themes = [ {"theme_id": "p1", "features": ["a", "b", "c"]}, {"theme_id": "p2", "features": ["x", "y", "z"]}, ] custom_themes = [ {"theme_id": "c1", "features": ["a", "b", "d"]}, {"theme_id": "c2", "features": ["x", "y", "z"]}, ] ``` Each theme dictionary contains: - `theme_id`: a unique theme identifier (string). - `features`: a list of strings representing extracted theme signals. This key may be missing or its value may be an empty list. Implement: ```python def find_likely_pirated_themes(pirated_themes, custom_themes, threshold=0.8): ... ``` For each custom theme: 1. Compare it against **every** known pirated theme using the Part 1 Jaccard similarity over `features`. 2. Find the single best-matching pirated theme (highest similarity). 3. Flag the custom theme if its best similarity score is **greater than or equal to** `threshold`. For each flagged custom theme, the returned result must include: - `custom_theme_id` - `matched_pirated_theme_id` - `similarity_score` Additional requirements: - Handle missing or empty `features` gracefully (a missing key must not raise). - Do not mutate any input object. - Make the output deterministic (a stable, well-defined ordering and a defined tie-break rule when two pirated themes score equally). ```hint Where to start Reuse your Part 1 function as the inner kernel. For each custom theme, loop over all pirated themes and keep a running "best" (score and the pirated `theme_id` that produced it). ``` ```hint Robustness Use `theme.get("features", [])` so a missing key yields an empty list instead of a `KeyError`. An empty custom feature set should never falsely match — your Part 1 empty-set rules already guarantee it scores `0.0` against any *non-empty* pirated theme. ``` ```hint Determinism & tie-breaks If the input order of `pirated_themes` is not guaranteed stable, two themes can tie for "best". Pick an explicit tie-break — e.g. iterate in a fixed order and only replace the best on a *strictly greater* score, or break ties by `theme_id`. Document whichever rule you choose so the output is reproducible. ``` #### Clarifying Questions for this Part - When a pirated theme's own `features` is empty, should it be skipped as a candidate entirely? (If not, Part 1's "both empty → 1.0" rule could make an empty custom theme match it and clear any `threshold <= 1.0`.) - Should a custom theme with missing/empty `features` ever appear in the output, or always be excluded? #### What This Part Should Cover - Clean reuse of the Part 1 function as the inner kernel rather than re-deriving the union/intersection math. - Graceful handling of missing / `None` / empty `features` on both sides (no exceptions) and no mutation of any input. - An explicit, *stated* tie-break rule together with a deterministic output ordering. - A return shape matching the spec exactly (`custom_theme_id`, `matched_pirated_theme_id`, `similarity_score`), plus a guard that avoids emitting a bogus flag when `pirated_themes` is empty. ### What a Strong Answer Covers These dimensions span both parts: - Modular design where Part 2 is a thin scan layered on top of the single Part 1 primitive, with no duplicated similarity math. - Awareness of overall correctness boundaries: scores live in `[0, 1]`, float-comparison caution near the `threshold`, and the $O(P \times C)$ cost of the all-pairs scan. - A few well-chosen test cases that exercise both parts together: identical sets, disjoint sets, partial overlap, empty/missing features, and duplicate tokens. ### Follow-up Questions - The naive scan is $O(P \times C)$ pairwise comparisons. If there were millions of themes, how would you make near-duplicate detection scalable (e.g. MinHash + Locality-Sensitive Hashing, or inverted indexes on features)? - Jaccard weights every feature equally. How would you incorporate the fact that some features (a rare custom file hash) are far more discriminative than common ones (a generic CSS class) — e.g. TF-IDF weighting or weighted Jaccard? - How would you choose and validate the `threshold` in production — what precision/recall trade-off, and how would you measure it against labeled piracy cases? - The `features` here are exact tokens. How would you extend this to detect *near*-identical assets (minified CSS with renamed classes, re-encoded images) rather than only exact-match signals?

Quick Answer: This question evaluates understanding of set-based similarity metrics (the Jaccard coefficient) and practical coding skills for transforming and comparing feature token sets, targeting algorithmic thinking and data representation competencies within the Coding & Algorithms domain for a Data Scientist role.

Jaccard Similarity of Two Token Lists

Implement a function that computes the **Jaccard similarity coefficient** between two lists of strings, treating each list as a set. The Jaccard similarity between two sets $A$ and $B$ is: $$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$ **Requirements:** - Return a `float` in the closed interval `[0, 1]`. - Treat the input lists as sets, so duplicate strings must not change the score. - Tokens are compared by exact string equality (case-sensitive). - If **both** lists are empty, return `1.0` (two empty sets are identical). - If **exactly one** list is empty, return `0.0`. - Do not mutate the inputs. **Example:** ``` list_a = ["a", "b", "c", "d"] list_b = ["b", "c", "d", "e"] # intersection = {b, c, d} (size 3), union = {a, b, c, d, e} (size 5) # expected = 3 / 5 = 0.6 ```

Constraints

Inputs are plain lists of strings (may be empty).
Lists may contain duplicate strings; duplicates must not affect the score.
Tokens are compared by exact, case-sensitive string equality.
Both empty -> 1.0; exactly one empty -> 0.0.
Result is a float in the closed interval [0, 1].

Examples

Input: (["a", "b", "c", "d"], ["b", "c", "d", "e"])

Expected Output: 0.6

Explanation: intersection {b,c,d}=3, union {a,b,c,d,e}=5, so 3/5 = 0.6.

Input: ([], [])

Expected Output: 1.0

Explanation: Two empty sets are identical by definition.

Input: (["a"], [])

Expected Output: 0.0

Explanation: Exactly one side empty: no possible overlap.

Input: (["a", "a", "b"], ["a", "b"])

Expected Output: 1.0

Explanation: Duplicates are ignored: both reduce to {a, b}, so the score is 1.0.

Input: (["a", "b"], ["c", "d"])

Expected Output: 0.0

Explanation: Disjoint sets: empty intersection over union of size 4 = 0.0.

Input: (["x", "y", "z"], ["x", "y", "z"])

Expected Output: 1.0

Explanation: Identical sets: intersection equals union, so 3/3 = 1.0.

Hints

Converting each list to a set deduplicates for free and unlocks the & (intersection) and | (union) operators that mirror the formula directly.
Handle the empty-input cases explicitly *before* dividing: when both inputs are empty the union is empty and the formula is the undefined 0/0. Map 'both empty -> 1.0' and 'one empty -> 0.0' as short-circuits.
intersection / union is true division in Python 3, so the result is always a float, and since 0 <= |A & B| <= |A | B| it always lands in [0, 1].

Identify Likely Pirated Custom Themes

You are given two lists of theme dictionaries. Each theme dict has a `theme_id` (unique string) and a `features` key whose value is a list of strings (this key may be **missing**, `None`, or an empty list). Implement `find_likely_pirated_themes(pirated_themes, custom_themes, threshold=0.8)`. For each custom theme: 1. Compare it against **every** pirated theme using the Jaccard similarity over `features` (same primitive as Part 1). 2. Find the single best-matching pirated theme (highest similarity). 3. Flag the custom theme if its best similarity score is **>= threshold**. For each flagged custom theme, the result dict must contain exactly: - `custom_theme_id` - `matched_pirated_theme_id` - `similarity_score` **Additional requirements:** - Handle missing / `None` / empty `features` gracefully (a missing key must not raise). - Do not mutate any input object. - Output is deterministic: emit custom themes in input order; ties for 'best pirated match' are broken toward the **first** pirated theme in input order (use strict `>` so the earlier winner is kept). - If `pirated_themes` is empty, never emit a bogus flag. **Example:** ``` pirated = [{"theme_id": "p1", "features": ["a","b","c"]}, {"theme_id": "p2", "features": ["x","y","z"]}] custom = [{"theme_id": "c1", "features": ["a","b","d"]}, # best vs p1: 0.5 {"theme_id": "c2", "features": ["x","y","z"]}, # best vs p2: 1.0 {"theme_id": "c3"}, # no features {"theme_id": "c4", "features": []}] # empty features # threshold=0.8 -> only c2 clears the bar. ```

Constraints

Each theme dict has a unique 'theme_id' string.
'features' may be missing, None, or an empty list — must not raise.
Inner kernel is the Part 1 Jaccard similarity over the 'features' lists.
Flag a custom theme when its best score is >= threshold (inclusive).
No input object may be mutated.
Deterministic output: custom themes in input order; ties broken toward the first pirated theme (strict '>').
Empty pirated_themes -> no flags emitted.

Examples

Input: ([{"theme_id": "p1", "features": ["a", "b", "c"]}, {"theme_id": "p2", "features": ["x", "y", "z"]}], [{"theme_id": "c1", "features": ["a", "b", "d"]}, {"theme_id": "c2", "features": ["x", "y", "z"]}, {"theme_id": "c3"}, {"theme_id": "c4", "features": []}], 0.8)

Expected Output: [{"custom_theme_id": "c2", "matched_pirated_theme_id": "p2", "similarity_score": 1.0}]

Explanation: c1 peaks at 0.5 (vs p1), c2 hits 1.0 (vs p2), c3/c4 have no features so score 0.0; only c2 clears 0.8.

Input: ([], [{"theme_id": "c1", "features": ["a", "b"]}], 0.8)

Expected Output: []

Explanation: No pirated themes means no candidate is ever set, so nothing is flagged.

Input: ([{"theme_id": "p1", "features": ["a", "b", "c"]}], [{"theme_id": "c1", "features": ["a", "b", "c"]}], 0.8)

Expected Output: [{"custom_theme_id": "c1", "matched_pirated_theme_id": "p1", "similarity_score": 1.0}]

Explanation: Identical feature sets score 1.0, clearing the threshold.

Input: ([{"theme_id": "p1", "features": ["a", "b", "c", "d"]}, {"theme_id": "p2", "features": ["a", "b", "c", "d"]}], [{"theme_id": "c1", "features": ["a", "b", "c", "d"]}], 0.5)

Expected Output: [{"custom_theme_id": "c1", "matched_pirated_theme_id": "p1", "similarity_score": 1.0}]

Explanation: Both pirated themes tie at 1.0; the strict '>' tie-break keeps p1, the first in input order.

Input: ([{"theme_id": "p1", "features": ["a", "b", "c"]}], [{"theme_id": "c1", "features": ["x", "y", "z"]}], 0.8)

Expected Output: []

Explanation: Disjoint feature sets score 0.0, below the threshold, so nothing is flagged.

Input: ([{"theme_id": "p1", "features": ["a", "b", "c", "d"]}], [{"theme_id": "c1", "features": ["a", "b", "c"]}], 0.75)

Expected Output: [{"custom_theme_id": "c1", "matched_pirated_theme_id": "p1", "similarity_score": 0.75}]

Explanation: intersection {a,b,c}=3 over union {a,b,c,d}=4 = 0.75, which exactly meets the inclusive >= threshold.

Hints

Reuse the Part 1 Jaccard function as the inner kernel — don't re-derive the union/intersection math.
Use theme.get('features') or [] so a missing key, None, or [] all resolve to an empty list; an empty custom feature set then scores 0.0 against any non-empty pirated theme and is never falsely flagged.
Keep a running best (score, pirated_id) per custom theme and only replace it on a *strictly greater* score, which deterministically keeps the first pirated theme on a tie. Guard the final emit with 'best_pirated_id is not None' so an empty pirated list produces no bogus flag.

Quick Overview

Compute Theme Similarity

Company: Shopify

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Jaccard Similarity of Two Token Lists

Constraints

Inputs are plain lists of strings (may be empty).
Lists may contain duplicate strings; duplicates must not affect the score.
Tokens are compared by exact, case-sensitive string equality.
Both empty -> 1.0; exactly one empty -> 0.0.
Result is a float in the closed interval [0, 1].

Examples

Input: (["a", "b", "c", "d"], ["b", "c", "d", "e"])

Expected Output: 0.6

Explanation: intersection {b,c,d}=3, union {a,b,c,d,e}=5, so 3/5 = 0.6.

Input: ([], [])

Expected Output: 1.0

Explanation: Two empty sets are identical by definition.

Input: (["a"], [])

Expected Output: 0.0

Explanation: Exactly one side empty: no possible overlap.

Input: (["a", "a", "b"], ["a", "b"])

Expected Output: 1.0

Explanation: Duplicates are ignored: both reduce to {a, b}, so the score is 1.0.

Input: (["a", "b"], ["c", "d"])

Expected Output: 0.0

Explanation: Disjoint sets: empty intersection over union of size 4 = 0.0.

Input: (["x", "y", "z"], ["x", "y", "z"])

Expected Output: 1.0

Explanation: Identical sets: intersection equals union, so 3/3 = 1.0.

Hints

Converting each list to a set deduplicates for free and unlocks the & (intersection) and | (union) operators that mirror the formula directly.
Handle the empty-input cases explicitly *before* dividing: when both inputs are empty the union is empty and the formula is the undefined 0/0. Map 'both empty -> 1.0' and 'one empty -> 0.0' as short-circuits.
intersection / union is true division in Python 3, so the result is always a float, and since 0 <= |A & B| <= |A | B| it always lands in [0, 1].

Identify Likely Pirated Custom Themes

Constraints

Each theme dict has a unique 'theme_id' string.
'features' may be missing, None, or an empty list — must not raise.
Inner kernel is the Part 1 Jaccard similarity over the 'features' lists.
Flag a custom theme when its best score is >= threshold (inclusive).
No input object may be mutated.
Deterministic output: custom themes in input order; ties broken toward the first pirated theme (strict '>').
Empty pirated_themes -> no flags emitted.