PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates understanding of set-based similarity metrics (the Jaccard coefficient) and practical coding skills for transforming and comparing feature token sets, targeting algorithmic thinking and data representation competencies within the Coding & Algorithms domain for a Data Scientist role.

  • medium
  • Shopify
  • Coding & Algorithms
  • Data Scientist

Compute Theme Similarity

Company: Shopify

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Implement a small set of Python functions to detect likely pirated Shopify themes by comparing their extracted feature sets using the **Jaccard similarity coefficient**. The Jaccard similarity between two sets $A$ and $B$ is defined as: $$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$ This problem has two parts: Part 1 builds the core similarity primitive, and Part 2 reuses it to scan a catalog of custom themes against a set of known pirated themes. ### Constraints & Assumptions - Inputs are plain Python lists of strings (Part 1) and lists of dictionaries (Part 2). No external libraries beyond the standard library are required. - `features` strings are treated as opaque tokens (extracted theme attributes, asset names, file hashes, CSS classes, etc.); equality is exact string equality. - Lists may contain duplicate strings; duplicates must not affect the score (the inputs are conceptually sets). - A single comparison runs over modest feature lists (tens to a few hundred tokens). The full pirated-vs-custom scan is $O(P \times C)$ pairwise comparisons, where $P$ and $C$ are the number of pirated and custom themes respectively — assume both fit comfortably in memory. - Functions must not mutate their input objects, and output ordering must be deterministic across runs. ### Clarifying Questions to Ask - Are `features` strings case-sensitive, and should they be normalized (trimmed/lowercased) before comparison, or compared exactly as given? - Is the threshold comparison inclusive (`>= threshold`) or strict (`> threshold`)? Should `threshold` be validated to lie in `[0, 1]`? - Can the same custom theme appear more than once in the input, and if so should it be flagged once or per occurrence? - For a flagged theme, do we need only the single best pirated match, or all pirated matches at or above the threshold? - Are `theme_id` values guaranteed unique and non-null within each list? ### Part 1: Basic Jaccard similarity Given two lists of strings, implement a function that returns their Jaccard similarity. Requirements: - Return a `float` in the closed interval `[0, 1]`. - Treat the input lists as sets, so duplicate strings must not change the score. - If **both** lists are empty, return `1.0` (two empty sets are identical). - If **exactly one** list is empty, return `0.0`. Example: ```python list_1 = ["a", "b", "c", "d"] list_2 = ["b", "c", "d", "e"] # intersection = {b, c, d} (size 3), union = {a, b, c, d, e} (size 5) # expected score = 3 / 5 = 0.6 ``` ```hint Data structure Converting each list to a `set` gives you O(1) membership and built-in `&` (intersection) and `|` (union) operators, and it deduplicates for free. ``` ```hint Edge cases Decide the empty-input behavior *before* dividing — when the union is empty (both inputs empty) the formula is $0/0$, so you must short-circuit it. Map "both empty → 1.0" and "one empty → 0.0" explicitly rather than relying on the division. ``` #### What This Part Should Cover - A correct, deduplicating implementation that maps the formula directly onto `set` operations. - All empty-input edge cases handled explicitly, with the division-by-zero case ($0/0$) short-circuited before any arithmetic. - A guaranteed `float` result in `[0, 1]`, plus awareness that comparing floats with `>=` near a boundary can invite rounding surprises. ### Part 2: Identify likely pirated custom themes You are given two lists of theme dictionaries: ```python pirated_themes = [ {"theme_id": "p1", "features": ["a", "b", "c"]}, {"theme_id": "p2", "features": ["x", "y", "z"]}, ] custom_themes = [ {"theme_id": "c1", "features": ["a", "b", "d"]}, {"theme_id": "c2", "features": ["x", "y", "z"]}, ] ``` Each theme dictionary contains: - `theme_id`: a unique theme identifier (string). - `features`: a list of strings representing extracted theme signals. This key may be missing or its value may be an empty list. Implement: ```python def find_likely_pirated_themes(pirated_themes, custom_themes, threshold=0.8): ... ``` For each custom theme: 1. Compare it against **every** known pirated theme using the Part 1 Jaccard similarity over `features`. 2. Find the single best-matching pirated theme (highest similarity). 3. Flag the custom theme if its best similarity score is **greater than or equal to** `threshold`. For each flagged custom theme, the returned result must include: - `custom_theme_id` - `matched_pirated_theme_id` - `similarity_score` Additional requirements: - Handle missing or empty `features` gracefully (a missing key must not raise). - Do not mutate any input object. - Make the output deterministic (a stable, well-defined ordering and a defined tie-break rule when two pirated themes score equally). ```hint Where to start Reuse your Part 1 function as the inner kernel. For each custom theme, loop over all pirated themes and keep a running "best" (score and the pirated `theme_id` that produced it). ``` ```hint Robustness Use `theme.get("features", [])` so a missing key yields an empty list instead of a `KeyError`. An empty custom feature set should never falsely match — your Part 1 empty-set rules already guarantee it scores `0.0` against any *non-empty* pirated theme. ``` ```hint Determinism & tie-breaks If the input order of `pirated_themes` is not guaranteed stable, two themes can tie for "best". Pick an explicit tie-break — e.g. iterate in a fixed order and only replace the best on a *strictly greater* score, or break ties by `theme_id`. Document whichever rule you choose so the output is reproducible. ``` #### Clarifying Questions for this Part - When a pirated theme's own `features` is empty, should it be skipped as a candidate entirely? (If not, Part 1's "both empty → 1.0" rule could make an empty custom theme match it and clear any `threshold <= 1.0`.) - Should a custom theme with missing/empty `features` ever appear in the output, or always be excluded? #### What This Part Should Cover - Clean reuse of the Part 1 function as the inner kernel rather than re-deriving the union/intersection math. - Graceful handling of missing / `None` / empty `features` on both sides (no exceptions) and no mutation of any input. - An explicit, *stated* tie-break rule together with a deterministic output ordering. - A return shape matching the spec exactly (`custom_theme_id`, `matched_pirated_theme_id`, `similarity_score`), plus a guard that avoids emitting a bogus flag when `pirated_themes` is empty. ### What a Strong Answer Covers These dimensions span both parts: - Modular design where Part 2 is a thin scan layered on top of the single Part 1 primitive, with no duplicated similarity math. - Awareness of overall correctness boundaries: scores live in `[0, 1]`, float-comparison caution near the `threshold`, and the $O(P \times C)$ cost of the all-pairs scan. - A few well-chosen test cases that exercise both parts together: identical sets, disjoint sets, partial overlap, empty/missing features, and duplicate tokens. ### Follow-up Questions - The naive scan is $O(P \times C)$ pairwise comparisons. If there were millions of themes, how would you make near-duplicate detection scalable (e.g. MinHash + Locality-Sensitive Hashing, or inverted indexes on features)? - Jaccard weights every feature equally. How would you incorporate the fact that some features (a rare custom file hash) are far more discriminative than common ones (a generic CSS class) — e.g. TF-IDF weighting or weighted Jaccard? - How would you choose and validate the `threshold` in production — what precision/recall trade-off, and how would you measure it against labeled piracy cases? - The `features` here are exact tokens. How would you extend this to detect *near*-identical assets (minified CSS with renamed classes, re-encoded images) rather than only exact-match signals?

Quick Answer: This question evaluates understanding of set-based similarity metrics (the Jaccard coefficient) and practical coding skills for transforming and comparing feature token sets, targeting algorithmic thinking and data representation competencies within the Coding & Algorithms domain for a Data Scientist role.

Jaccard Similarity of Two Token Lists

Implement a function that computes the **Jaccard similarity coefficient** between two lists of strings, treating each list as a set. The Jaccard similarity between two sets $A$ and $B$ is: $$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$ **Requirements:** - Return a `float` in the closed interval `[0, 1]`. - Treat the input lists as sets, so duplicate strings must not change the score. - Tokens are compared by exact string equality (case-sensitive). - If **both** lists are empty, return `1.0` (two empty sets are identical). - If **exactly one** list is empty, return `0.0`. - Do not mutate the inputs. **Example:** ``` list_a = ["a", "b", "c", "d"] list_b = ["b", "c", "d", "e"] # intersection = {b, c, d} (size 3), union = {a, b, c, d, e} (size 5) # expected = 3 / 5 = 0.6 ```

Constraints

  • Inputs are plain lists of strings (may be empty).
  • Lists may contain duplicate strings; duplicates must not affect the score.
  • Tokens are compared by exact, case-sensitive string equality.
  • Both empty -> 1.0; exactly one empty -> 0.0.
  • Result is a float in the closed interval [0, 1].

Examples

Input: (["a", "b", "c", "d"], ["b", "c", "d", "e"])

Expected Output: 0.6

Explanation: intersection {b,c,d}=3, union {a,b,c,d,e}=5, so 3/5 = 0.6.

Input: ([], [])

Expected Output: 1.0

Explanation: Two empty sets are identical by definition.

Input: (["a"], [])

Expected Output: 0.0

Explanation: Exactly one side empty: no possible overlap.

Input: (["a", "a", "b"], ["a", "b"])

Expected Output: 1.0

Explanation: Duplicates are ignored: both reduce to {a, b}, so the score is 1.0.

Input: (["a", "b"], ["c", "d"])

Expected Output: 0.0

Explanation: Disjoint sets: empty intersection over union of size 4 = 0.0.

Input: (["x", "y", "z"], ["x", "y", "z"])

Expected Output: 1.0

Explanation: Identical sets: intersection equals union, so 3/3 = 1.0.

Hints

  1. Converting each list to a set deduplicates for free and unlocks the & (intersection) and | (union) operators that mirror the formula directly.
  2. Handle the empty-input cases explicitly *before* dividing: when both inputs are empty the union is empty and the formula is the undefined 0/0. Map 'both empty -> 1.0' and 'one empty -> 0.0' as short-circuits.
  3. intersection / union is true division in Python 3, so the result is always a float, and since 0 <= |A & B| <= |A | B| it always lands in [0, 1].

Identify Likely Pirated Custom Themes

You are given two lists of theme dictionaries. Each theme dict has a `theme_id` (unique string) and a `features` key whose value is a list of strings (this key may be **missing**, `None`, or an empty list). Implement `find_likely_pirated_themes(pirated_themes, custom_themes, threshold=0.8)`. For each custom theme: 1. Compare it against **every** pirated theme using the Jaccard similarity over `features` (same primitive as Part 1). 2. Find the single best-matching pirated theme (highest similarity). 3. Flag the custom theme if its best similarity score is **>= threshold**. For each flagged custom theme, the result dict must contain exactly: - `custom_theme_id` - `matched_pirated_theme_id` - `similarity_score` **Additional requirements:** - Handle missing / `None` / empty `features` gracefully (a missing key must not raise). - Do not mutate any input object. - Output is deterministic: emit custom themes in input order; ties for 'best pirated match' are broken toward the **first** pirated theme in input order (use strict `>` so the earlier winner is kept). - If `pirated_themes` is empty, never emit a bogus flag. **Example:** ``` pirated = [{"theme_id": "p1", "features": ["a","b","c"]}, {"theme_id": "p2", "features": ["x","y","z"]}] custom = [{"theme_id": "c1", "features": ["a","b","d"]}, # best vs p1: 0.5 {"theme_id": "c2", "features": ["x","y","z"]}, # best vs p2: 1.0 {"theme_id": "c3"}, # no features {"theme_id": "c4", "features": []}] # empty features # threshold=0.8 -> only c2 clears the bar. ```

Constraints

  • Each theme dict has a unique 'theme_id' string.
  • 'features' may be missing, None, or an empty list — must not raise.
  • Inner kernel is the Part 1 Jaccard similarity over the 'features' lists.
  • Flag a custom theme when its best score is >= threshold (inclusive).
  • No input object may be mutated.
  • Deterministic output: custom themes in input order; ties broken toward the first pirated theme (strict '>').
  • Empty pirated_themes -> no flags emitted.

Examples

Input: ([{"theme_id": "p1", "features": ["a", "b", "c"]}, {"theme_id": "p2", "features": ["x", "y", "z"]}], [{"theme_id": "c1", "features": ["a", "b", "d"]}, {"theme_id": "c2", "features": ["x", "y", "z"]}, {"theme_id": "c3"}, {"theme_id": "c4", "features": []}], 0.8)

Expected Output: [{"custom_theme_id": "c2", "matched_pirated_theme_id": "p2", "similarity_score": 1.0}]

Explanation: c1 peaks at 0.5 (vs p1), c2 hits 1.0 (vs p2), c3/c4 have no features so score 0.0; only c2 clears 0.8.

Input: ([], [{"theme_id": "c1", "features": ["a", "b"]}], 0.8)

Expected Output: []

Explanation: No pirated themes means no candidate is ever set, so nothing is flagged.

Input: ([{"theme_id": "p1", "features": ["a", "b", "c"]}], [{"theme_id": "c1", "features": ["a", "b", "c"]}], 0.8)

Expected Output: [{"custom_theme_id": "c1", "matched_pirated_theme_id": "p1", "similarity_score": 1.0}]

Explanation: Identical feature sets score 1.0, clearing the threshold.

Input: ([{"theme_id": "p1", "features": ["a", "b", "c", "d"]}, {"theme_id": "p2", "features": ["a", "b", "c", "d"]}], [{"theme_id": "c1", "features": ["a", "b", "c", "d"]}], 0.5)

Expected Output: [{"custom_theme_id": "c1", "matched_pirated_theme_id": "p1", "similarity_score": 1.0}]

Explanation: Both pirated themes tie at 1.0; the strict '>' tie-break keeps p1, the first in input order.

Input: ([{"theme_id": "p1", "features": ["a", "b", "c"]}], [{"theme_id": "c1", "features": ["x", "y", "z"]}], 0.8)

Expected Output: []

Explanation: Disjoint feature sets score 0.0, below the threshold, so nothing is flagged.

Input: ([{"theme_id": "p1", "features": ["a", "b", "c", "d"]}], [{"theme_id": "c1", "features": ["a", "b", "c"]}], 0.75)

Expected Output: [{"custom_theme_id": "c1", "matched_pirated_theme_id": "p1", "similarity_score": 0.75}]

Explanation: intersection {a,b,c}=3 over union {a,b,c,d}=4 = 0.75, which exactly meets the inclusive >= threshold.

Hints

  1. Reuse the Part 1 Jaccard function as the inner kernel — don't re-derive the union/intersection math.
  2. Use theme.get('features') or [] so a missing key, None, or [] all resolve to an empty list; an empty custom feature set then scores 0.0 against any non-empty pirated theme and is never falsely flagged.
  3. Keep a running best (score, pirated_id) per custom theme and only replace it on a *strictly greater* score, which deterministically keeps the first pirated theme on a tie. Guard the final emit with 'best_pirated_id is not None' so an empty pirated list produces no bogus flag.
Last updated: Jun 26, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Grid Robot Command Simulator - Shopify (medium)
  • Compute Jaccard Similarity for Lists - Shopify (medium)
  • Implement URL Shortening Codec - Shopify (medium)
  • Simulate a rover fleet - Shopify (medium)
  • Simulate robot moves on a grid - Shopify (medium)