PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates proficiency with set operations and similarity metrics (Jaccard), scalable algorithm design and complexity analysis for large datasets, and the ability to handle streaming updates and deterministic tie-breaking.

  • Medium
  • Pinterest
  • Coding & Algorithms
  • Data Scientist

Find list pair with maximum overlap

Company: Pinterest

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Onsite

You are given N labeled lists of items as a Python dict mapping list_name -> iterable of strings. Example input: {'L1': ['A','B','C'], 'L2': ['A','C','D'], 'L3': ['B','C','E'], 'L4': []}. Treat each list as a set (ignore duplicates). Write a function that returns the single unordered pair of list names with the largest overlap count (size of intersection), along with that overlap count and the Jaccard similarity (|A∩B| / |A∪B|). Tie-breakers: 1) higher Jaccard, 2) lexicographically smaller pair (by list name). For the example above, the expected result is ('L1','L2', overlap=2, jaccard=0.5). Constraints: up to 200,000 lists, 5,000,000 total items; items are case-sensitive strings; memory is limited so an O(N^2) all-pairs comparison is not acceptable. 1) Describe your algorithm at a high level (e.g., an inverted index over items) and analyze time/space complexity. 2) Implement it in Python. 3) Extend to return the top-k pairs efficiently. 4) Explain how you would adapt your solution if the input is a stream of (list_name, item) updates where lists can grow over time.

Quick Answer: This question evaluates proficiency with set operations and similarity metrics (Jaccard), scalable algorithm design and complexity analysis for large datasets, and the ability to handle streaming updates and deterministic tie-breaking.

Part 1: Inverted-Index Overlap Statistics

You are given a Python dict mapping list_name -> iterable of strings. Treat each list as a set, so duplicates inside the same list do not matter. Build the core statistics behind an inverted-index solution for overlap detection. Return: (1) membership_count = total number of distinct (list_name, item) memberships after deduplication, (2) unique_item_count = number of distinct items overall, (3) pair_updates = the total number of pair increments an inverted-index algorithm would perform, equal to the sum over items of C(f, 2) where f is the number of lists containing that item, and (4) pair_counts = all unordered list-name pairs with positive overlap, with their overlap counts, sorted lexicographically by pair.

Constraints

  • 1 <= number of lists <= 200000, but your algorithm should also handle empty input.
  • The total number of raw items across all iterables can be up to 5000000.
  • Items are case-sensitive strings.
  • Duplicates within the same list must be ignored.

Examples

Input: ({'L1': ['A', 'B', 'C'], 'L2': ['A', 'C', 'D'], 'L3': ['B', 'C', 'E'], 'L4': []},)

Expected Output: {'membership_count': 9, 'unique_item_count': 5, 'pair_updates': 5, 'pair_counts': [('L1', 'L2', 2), ('L1', 'L3', 2), ('L2', 'L3', 1)]}

Explanation: After deduplication there are 9 distinct memberships. Item frequencies are A:2, B:2, C:3, D:1, E:1, so pair_updates = 1 + 1 + 3 = 5.

Input: ({'A': ['x', 'x'], 'B': ['x', 'y', 'y'], 'C': []},)

Expected Output: {'membership_count': 3, 'unique_item_count': 2, 'pair_updates': 1, 'pair_counts': [('A', 'B', 1)]}

Explanation: Duplicates inside a list do not count. Only item x is shared.

Input: ({'A': ['a'], 'B': ['b'], 'C': ['c']},)

Expected Output: {'membership_count': 3, 'unique_item_count': 3, 'pair_updates': 0, 'pair_counts': []}

Explanation: No item appears in more than one list, so there are no pair updates.

Input: ({},)

Expected Output: {'membership_count': 0, 'unique_item_count': 0, 'pair_updates': 0, 'pair_counts': []}

Explanation: Empty input is valid.

Input: ({'Only': ['z', 'z']},)

Expected Output: {'membership_count': 1, 'unique_item_count': 1, 'pair_updates': 0, 'pair_counts': []}

Explanation: A single list cannot form any pair.

Solution

def solution(lists_by_name):
    from collections import defaultdict

    item_to_lists = defaultdict(list)
    membership_count = 0

    for name, items in lists_by_name.items():
        unique_items = set(items)
        membership_count += len(unique_items)
        for item in unique_items:
            item_to_lists[item].append(name)

    pair_counts = defaultdict(int)
    pair_updates = 0

    for names in item_to_lists.values():
        names = sorted(names)
        m = len(names)
        pair_updates += m * (m - 1) // 2
        for i in range(m):
            for j in range(i + 1, m):
                pair_counts[(names[i], names[j])] += 1

    sorted_pairs = [(a, b, count) for (a, b), count in sorted(pair_counts.items())]
    return {
        'membership_count': membership_count,
        'unique_item_count': len(item_to_lists),
        'pair_updates': pair_updates,
        'pair_counts': sorted_pairs
    }

Time complexity: O(M + sum_i C(f_i, 2)), where M is the number of distinct (list, item) memberships after deduplication and f_i is the number of lists containing item i.. Space complexity: O(M + U + P), where U is the number of unique items and P is the number of pairs with positive overlap..

Hints

  1. First deduplicate each individual list, then think item -> lists that contain it.
  2. If one item appears in f different lists, it contributes C(f, 2) pair updates.

Part 2: Maximum-Overlap List Pair

You are given a Python dict mapping list_name -> iterable of strings. Treat each list as a set, so duplicates inside the same list do not matter. Return the single unordered pair of list names with the largest overlap count, along with the overlap count and the Jaccard similarity. Tie-breakers are: (1) larger overlap count, (2) larger Jaccard similarity, (3) lexicographically smaller pair. If there are fewer than 2 lists, return None. Define the Jaccard similarity of two empty sets as 1.0.

Constraints

  • 1 <= number of lists <= 200000, but your function should also handle 0 or 1 list.
  • The total number of raw items across all iterables can be up to 5000000.
  • Items are case-sensitive strings.
  • An O(N^2) all-pairs comparison is not acceptable for large inputs.

Examples

Input: ({'L1': ['A', 'B', 'C'], 'L2': ['A', 'C', 'D'], 'L3': ['B', 'C', 'E'], 'L4': []},)

Expected Output: ('L1', 'L2', 2, 0.5)

Explanation: L1-L2 and L1-L3 both have overlap 2 and Jaccard 0.5, so lexicographic order picks ('L1', 'L2').

Input: ({'A': ['x', 'y'], 'B': ['x'], 'C': ['y', 'z']},)

Expected Output: ('A', 'B', 1, 0.5)

Explanation: A-B and A-C both overlap by 1, but A-B has the larger Jaccard score.

Input: ({'A': ['x'], 'B': ['x'], 'C': ['y'], 'D': ['y']},)

Expected Output: ('A', 'B', 1, 1.0)

Explanation: A-B and C-D tie on both overlap and Jaccard, so the lexicographically smaller pair wins.

Input: ({'A': [], 'B': [], 'C': ['x']},)

Expected Output: ('A', 'B', 0, 1.0)

Explanation: No pair has positive overlap. Two empty lists have Jaccard 1.0, which beats all other zero-overlap pairs.

Input: ({'Solo': ['a']},)

Expected Output: None

Explanation: At least two lists are required.

Solution

def solution(lists_by_name):
    from collections import defaultdict

    names = sorted(lists_by_name)
    if len(names) < 2:
        return None

    item_to_lists = defaultdict(list)
    sizes = {}
    empty_names = []

    for name in names:
        unique_items = set(lists_by_name[name])
        sizes[name] = len(unique_items)
        if not unique_items:
            empty_names.append(name)
        for item in unique_items:
            item_to_lists[item].append(name)

    pair_counts = defaultdict(int)
    for holders in item_to_lists.values():
        holders.sort()
        for i in range(len(holders)):
            a = holders[i]
            for j in range(i + 1, len(holders)):
                b = holders[j]
                pair_counts[(a, b)] += 1

    best = None  # (overlap, jaccard, a, b)
    for (a, b), inter in pair_counts.items():
        union = sizes[a] + sizes[b] - inter
        jaccard = 1.0 if union == 0 else inter / union
        if (
            best is None or
            inter > best[0] or
            (inter == best[0] and (jaccard > best[1] or (jaccard == best[1] and (a, b) < (best[2], best[3]))))
        ):
            best = (inter, jaccard, a, b)

    if best is not None:
        return (best[2], best[3], best[0], best[1])

    if len(empty_names) >= 2:
        return (empty_names[0], empty_names[1], 0, 1.0)

    return (names[0], names[1], 0, 0.0)

Time complexity: O(M + sum_i C(f_i, 2) + P), where M is the number of distinct memberships after deduplication, f_i is the number of lists containing item i, and P is the number of distinct pairs with positive overlap.. Space complexity: O(M + U + P), where U is the number of unique items..

Hints

  1. Use an inverted index item -> list names to count only pairs that actually share at least one item.
  2. If no pair shares an item, handle the zero-overlap case separately.

Part 3: Top-K Overlapping List Pairs

You are given a Python dict mapping list_name -> iterable of strings and an integer k. Treat each list as a set, so duplicates inside the same list do not matter. Return the top k unordered pairs with positive overlap only, ordered by: (1) larger overlap count, (2) larger Jaccard similarity, (3) lexicographically smaller pair. If fewer than k pairs have positive overlap, return all of them. If k <= 0, return an empty list.

Constraints

  • 1 <= number of lists <= 200000, but your solution should also handle empty input.
  • The total number of raw items across all iterables can be up to 5000000.
  • Items are case-sensitive strings.
  • Only pairs with overlap_count > 0 should appear in the result.

Examples

Input: ({'L1': ['A', 'B', 'C'], 'L2': ['A', 'C', 'D'], 'L3': ['B', 'C', 'E'], 'L4': []}, 2)

Expected Output: [('L1', 'L2', 2, 0.5), ('L1', 'L3', 2, 0.5)]

Explanation: The two best positive-overlap pairs both have overlap 2 and Jaccard 0.5.

Input: ({'A': ['x', 'x', 'y'], 'B': ['x'], 'C': ['x', 'y'], 'D': ['z']}, 3)

Expected Output: [('A', 'C', 2, 1.0), ('A', 'B', 1, 0.5), ('B', 'C', 1, 0.5)]

Explanation: Duplicates are ignored. A-C is the only pair with overlap 2.

Input: ({'A': ['m'], 'B': ['m'], 'C': ['n']}, 5)

Expected Output: [('A', 'B', 1, 1.0)]

Explanation: There is only one positive-overlap pair.

Input: ({'A': ['a'], 'B': ['b']}, 2)

Expected Output: []

Explanation: No pair shares an item.

Input: ({'A': ['a'], 'B': ['a']}, 0)

Expected Output: []

Explanation: k <= 0 always returns an empty list.

Solution

def solution(lists_by_name, k):
    from collections import defaultdict
    import heapq

    if k <= 0:
        return []

    item_to_lists = defaultdict(list)
    sizes = {}

    for name, items in lists_by_name.items():
        unique_items = set(items)
        sizes[name] = len(unique_items)
        for item in unique_items:
            item_to_lists[item].append(name)

    pair_counts = defaultdict(int)
    for holders in item_to_lists.values():
        holders.sort()
        for i in range(len(holders)):
            a = holders[i]
            for j in range(i + 1, len(holders)):
                b = holders[j]
                pair_counts[(a, b)] += 1

    if not pair_counts:
        return []

    def ranked_candidates():
        for (a, b), inter in pair_counts.items():
            union = sizes[a] + sizes[b] - inter
            jaccard = inter / union
            yield (-inter, -jaccard, a, b, inter, jaccard)

    top = heapq.nsmallest(k, ranked_candidates())
    return [(a, b, inter, jaccard) for _, _, a, b, inter, jaccard in top]

Time complexity: O(M + sum_i C(f_i, 2) + P log k), where M is the number of distinct memberships after deduplication, f_i is the number of lists containing item i, and P is the number of distinct pairs with positive overlap.. Space complexity: O(M + U + P + k), where U is the number of unique items..

Hints

  1. First compute positive-overlap pair counts using an inverted index.
  2. A heap of size k can avoid fully sorting every candidate pair.

Part 4: Best Pair Under Streaming List Updates

Initially, no lists exist. You receive a stream of updates of the form (list_name, item), meaning that item is added to that list. Lists only grow over time. If the same (list_name, item) appears again, ignore it. After each update, return the current best unordered pair of existing list names ranked by: (1) larger overlap count, (2) larger Jaccard similarity, (3) lexicographically smaller pair. If fewer than 2 lists exist after an update, return None for that step. Because lists are created only by updates, every existing list is non-empty.

Constraints

  • 1 <= number of updates <= 300000, but your solution should also handle empty input.
  • Each update is a pair (list_name, item).
  • A duplicate update for an already-present (list_name, item) must not change the state.
  • A fully recomputed all-pairs scan after every update is too slow.

Examples

Input: ([('L1', 'A'), ('L2', 'B'), ('L2', 'B'), ('L1', 'B'), ('L2', 'A')],)

Expected Output: [None, ('L1', 'L2', 0, 0.0), ('L1', 'L2', 0, 0.0), ('L1', 'L2', 1, 0.5), ('L1', 'L2', 2, 1.0)]

Explanation: The duplicate update ('L2', 'B') does nothing. The best pair evolves as overlap grows.

Input: ([('L1', 'A'), ('L2', 'A'), ('L3', 'B'), ('L4', 'B'), ('L1', 'X'), ('L2', 'X')],)

Expected Output: [None, ('L1', 'L2', 1, 1.0), ('L1', 'L2', 1, 1.0), ('L1', 'L2', 1, 1.0), ('L3', 'L4', 1, 1.0), ('L1', 'L2', 2, 1.0)]

Explanation: Adding a unique item to L1 lowers the Jaccard score of L1-L2, so L3-L4 temporarily becomes the best pair.

Input: ([('B', 'x'), ('A', 'y'), ('C', 'z')],)

Expected Output: [None, ('A', 'B', 0, 0.0), ('A', 'B', 0, 0.0)]

Explanation: With no overlaps, the lexicographically smallest pair of existing lists wins.

Input: ([('L1', 'A'), ('L1', 'A')],)

Expected Output: [None, None]

Explanation: A duplicate update on the only existing list changes nothing.

Solution

def solution(updates):
    from collections import defaultdict
    import heapq

    list_items = {}
    list_sizes = defaultdict(int)
    item_lists = defaultdict(set)
    pair_overlap = {}
    neighbors = defaultdict(set)
    bucket_heaps = defaultdict(list)
    max_overlap = 0

    seen_lists = set()
    smallest_two = []

    def add_name(name):
        if name in seen_lists:
            return
        seen_lists.add(name)
        if len(smallest_two) < 2:
            smallest_two.append(name)
            smallest_two.sort()
        else:
            if name < smallest_two[0]:
                smallest_two[1] = smallest_two[0]
                smallest_two[0] = name
            elif name < smallest_two[1]:
                smallest_two[1] = name

    def pair_key(a, b):
        return (a, b) if a < b else (b, a)

    def push_current_entry(a, b):
        inter = pair_overlap[(a, b)]
        union = list_sizes[a] + list_sizes[b] - inter
        jaccard = inter / union
        heapq.heappush(bucket_heaps[inter], (-jaccard, a, b, list_sizes[a], list_sizes[b]))

    def current_best():
        nonlocal max_overlap
        while max_overlap > 0:
            heap = bucket_heaps[max_overlap]
            while heap:
                neg_j, a, b, sa, sb = heap[0]
                if pair_overlap.get((a, b), 0) != max_overlap or list_sizes[a] != sa or list_sizes[b] != sb:
                    heapq.heappop(heap)
                else:
                    inter = max_overlap
                    union = list_sizes[a] + list_sizes[b] - inter
                    return (a, b, inter, inter / union)
            max_overlap -= 1

        if len(seen_lists) < 2:
            return None
        return (smallest_two[0], smallest_two[1], 0, 0.0)

    answers = []

    for name, item in updates:
        if name not in list_items:
            list_items[name] = set()
            add_name(name)

        if item in list_items[name]:
            answers.append(current_best())
            continue

        shared_with = set(item_lists[item])

        list_items[name].add(item)
        list_sizes[name] += 1

        for other in list(neighbors[name]):
            if other not in shared_with:
                a, b = pair_key(name, other)
                push_current_entry(a, b)

        for other in shared_with:
            a, b = pair_key(name, other)
            old_overlap = pair_overlap.get((a, b), 0)
            pair_overlap[(a, b)] = old_overlap + 1
            neighbors[name].add(other)
            neighbors[other].add(name)
            if pair_overlap[(a, b)] > max_overlap:
                max_overlap = pair_overlap[(a, b)]
            push_current_entry(a, b)

        item_lists[item].add(name)
        answers.append(current_best())

    return answers

Time complexity: Amortized O((s + d) log H) per distinct update, where s is the number of other lists already containing the updated item, d is the number of current positive-overlap neighbors of the updated list, and H is the number of heap entries currently stored. Duplicate updates are O(1) amortized plus lazy-heap cleanup during queries.. Space complexity: O(M + P + R), where M is the number of distinct (list, item) memberships seen so far, P is the number of pairs with positive overlap, and R is the number of lazy heap entries pushed over time..

Hints

  1. An update only changes overlap counts for lists that already had the same item, and it only changes Jaccard values for positive-overlap neighbors of the updated list.
  2. Use lazy heap entries: push new scores when something changes, and discard stale entries when they reach the top.
Last updated: May 19, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Maximize Boxes Stored Through One Entrance - Pinterest (medium)
  • Solve Multiple Coding Interview Problems - Pinterest (medium)
  • Implement a Sparse Matrix Class - Pinterest (medium)
  • Assign Pins to Shortest Columns - Pinterest (medium)
  • Design Hierarchical Permission Checks - Pinterest (medium)