PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches

Quick Overview

This question evaluates understanding of similarity metrics (Jaccard similarity), set-based data representations, and the competency to implement and reason about list- or token-level similarity comparisons.

  • Medium
  • Shopify
  • Coding & Algorithms
  • Data Scientist

Identify Pirate-Themed Custom Themes Using Jaccard Similarity

Company: Shopify

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

##### Scenario Engineering receives two lists and must identify custom themes that look like Pirate themes. ##### Question a) Write a Python function that computes the Jaccard similarity len(intersection)/len(union) between any two lists. b) Given pirate_themes (list[dict]) and custom_themes (list[dict]), return custom themes whose similarity to at least one pirate theme exceeds a chosen threshold. c) Describe how you would handle edge cases and scale this computation to millions of themes. ##### Hints Use set operations; handle empty lists; consider MinHash or locality-sensitive hashing for scale.

Quick Answer: This question evaluates understanding of similarity metrics (Jaccard similarity), set-based data representations, and the competency to implement and reason about list- or token-level similarity comparisons.

You are given two lists: pirate_themes and custom_themes. Each element is a dict with keys 'name' (string) and 'tags' (list of strings). Define the Jaccard similarity between two tag lists as |intersection| / |union| using set semantics (duplicates ignored). If both tag lists are empty, define similarity as 1.0. Return the list of names of custom themes whose similarity to at least one pirate theme is greater than or equal to threshold. Preserve the order of custom_themes in the output. Tag comparison is case-sensitive. If pirate_themes is empty, return an empty list.

Constraints

  • 0 <= len(pirate_themes), len(custom_themes) <= 10^4
  • Each theme dict contains keys: 'name' (unique string) and 'tags' (list of strings)
  • 0 <= len(tags) <= 10^3 per theme; duplicates in 'tags' are ignored
  • 0.0 <= threshold <= 1.0
  • Tag comparison is case-sensitive
  • Return names in the same order as they appear in custom_themes

Solution

def find_similar_custom_themes(pirate_themes: list[dict], custom_themes: list[dict], threshold: float) -> list[str]:
    # Precompute pirate tag sets
    pirate_sets = [set(t.get("tags", [])) for t in pirate_themes]

    def jaccard(a: set, b: set) -> float:
        # Both empty => similarity 1.0
        union_size = len(a | b)
        if union_size == 0:
            return 1.0
        inter_size = len(a & b)
        return inter_size / union_size

    result = []
    for c in custom_themes:
        cset = set(c.get("tags", []))
        for pset in pirate_sets:
            if jaccard(pset, cset) >= threshold:
                result.append(c.get("name"))
                break
    return result
Explanation
Convert each theme's tag list to a set to ignore duplicates. For each custom theme, compute the Jaccard similarity against each pirate theme: similarity = |A ∩ B| / |A ∪ B|. If the union is empty (both sets empty), define similarity as 1.0. Add the custom theme's name to the answer as soon as any pirate theme meets or exceeds the threshold. Preserve input order by iterating custom_themes in order and appending when qualified.

Time complexity: O(P * C * T), where P is the number of pirate themes, C is the number of custom themes, and T is the average number of unique tags per theme (cost of set operations).. Space complexity: O(P * T) for storing pirate tag sets; O(1) additional space aside from input and output..

Hints

  1. Convert each theme's tags list to a set to ignore duplicates.
  2. Jaccard similarity is len(A ∩ B) / len(A ∪ B); if the union is empty, treat similarity as 1.0.
  3. Precompute sets for pirate themes to avoid repeated conversions.
  4. Short-circuit once any pirate theme meets the threshold for a custom theme.
  5. For large-scale data, consider MinHash/LSH or inverted-index filtering to reduce pairwise comparisons.
Last updated: Mar 29, 2026

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Compute Jaccard Similarity for Lists - Shopify (medium)
  • Implement URL Shortening Codec - Shopify (medium)
  • Simulate a rover fleet - Shopify (medium)
  • Simulate robot moves on a grid - Shopify (medium)
  • Implement a Capacity-Bounded Cache - Shopify (medium)