How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a Medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Harvey AI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Harvey AI during technical interviews.

Handle multi-source string matching and tagging

Company: Harvey AI

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

##### Question Given an LLM output string and a list of source strings, design an algorithm to count how many times each source appears in the output. Extend the solution to return the output string where every occurrence of a source string is wrapped in <tag></tag> while correctly handling overlapping matches. Modify the algorithm so that each tag also includes annotations of the indices of the sources that matched (e.g., <tag>text</tag>[1][3]).

Quick Answer: This question evaluates skills in string matching, pattern recognition, overlap handling, text annotation, and counting occurrences across multiple source strings.

Given a string output and a list of source strings sources, return two results: (1) counts: a list where counts[i] is the number of occurrences (allowing overlaps) of sources[i] in output; (2) tagged: a new string constructed by wrapping each maximal contiguous region of output that is covered by at least one match with <tag> and </tag>. After each such tag, append bracketed source indices in ascending order (e.g., [0][3]) indicating all distinct i for which at least one occurrence of sources[i] overlaps that region. Indices are 0-based. Overlapping or touching matches are merged into a single tagged region. Matching is exact and case-sensitive.

Constraints

1 <= len(output) <= 200000
1 <= len(sources) <= 5000
1 <= sum(len(s) for s in sources) <= 200000
All sources[i] are non-empty; duplicates allowed
Matching is exact and case-sensitive
Indices in annotations are 0-based, unique per region, and sorted ascending

Solution

from typing import List, Tuple
from collections import deque
import heapq

def multi_source_tagging(output: str, sources: List[str]) -> Tuple[List[int], str]:
    n = len(output)
    m = len(sources)
    # Build Aho-Corasick automaton
    nexts: List[dict] = []
    fail: List[int] = []
    out: List[List[int]] = []

    def new_node() -> int:
        nexts.append({})
        fail.append(0)
        out.append([])
        return len(nexts) - 1

    new_node()  # root = 0
    lens = [len(p) for p in sources]

    for idx, pat in enumerate(sources):
        # Sources are guaranteed non-empty by constraints
        node = 0
        for ch in pat:
            if ch not in nexts[node]:
                nexts[node][ch] = new_node()
            node = nexts[node][ch]
        out[node].append(idx)

    # Build failure links and output links
    q = deque()
    for ch, v in nexts[0].items():
        fail[v] = 0
        q.append(v)

    while q:
        u = q.popleft()
        for ch, v in nexts[u].items():
            f = fail[u]
            while f and ch not in nexts[f]:
                f = fail[f]
            fail[v] = nexts[f][ch] if ch in nexts[f] else 0
            # inherit outputs from failure link
            if out[fail[v]]:
                out[v].extend(out[fail[v]])
            q.append(v)

    # Traverse output string to collect matches and counts
    counts = [0] * m
    coverage_diff = [0] * (n + 1)
    matches: List[Tuple[int, int, int]] = []  # (start, end, source_index)

    state = 0
    for i, ch in enumerate(output):
        while state and ch not in nexts[state]:
            state = fail[state]
        if ch in nexts[state]:
            state = nexts[state][ch]
        else:
            state = 0
        if out[state]:
            for idx in out[state]:
                L = i + 1 - lens[idx]
                R = i + 1
                # L >= 0 always holds; lens[idx] > 0 by constraints
                counts[idx] += 1
                matches.append((L, R, idx))
                coverage_diff[L] += 1
                coverage_diff[R] -= 1

    # Build maximal covered regions (merge overlapping or touching via coverage)
    segments: List[Tuple[int, int]] = []
    cover = 0
    seg_start = None
    for i in range(n):
        cover += coverage_diff[i]
        if cover > 0 and seg_start is None:
            seg_start = i
        elif cover == 0 and seg_start is not None:
            segments.append((seg_start, i))
            seg_start = None
    if seg_start is not None:
        segments.append((seg_start, n))

    if not segments:
        return counts, output

    # Annotate each segment with indices of sources that have at least one match overlapping the segment
    matches.sort(key=lambda x: x[0])  # sort by start
    M = len(matches)
    p = 0
    heap: List[Tuple[int, int]] = []  # (end, match_id)
    active_counts = {}  # source_idx -> count of active overlapping matches

    res_parts: List[str] = []
    prev = 0

    for L, R in segments:
        # Add all matches starting before R
        while p < M and matches[p][0] < R:
            _, end, idx = matches[p]
            heapq.heappush(heap, (end, p))
            active_counts[idx] = active_counts.get(idx, 0) + 1
            p += 1
        # Remove all matches that end at or before L
        while heap and heap[0][0] <= L:
            _, mid = heapq.heappop(heap)
            _, _, idx = matches[mid]
            cnt = active_counts.get(idx, 0)
            if cnt <= 1:
                active_counts.pop(idx, None)
            else:
                active_counts[idx] = cnt - 1

        # Emit segment with tag and annotations
        res_parts.append(output[prev:L])
        res_parts.append("<tag>")
        res_parts.append(output[L:R])
        res_parts.append("</tag>")
        if active_counts:
            for k in sorted(active_counts.keys()):
                res_parts.append(f"[{k}]")
        prev = R

    res_parts.append(output[prev:])
    tagged = ''.join(res_parts)

    return counts, tagged

Explanation

Build an Aho-Corasick automaton over sources to find all matches in a single left-to-right scan of output. For counting, increment counts[i] for each match of sources[i], allowing overlaps. For tagging, record each match interval [L,R) and build a difference array over character indices to compute a coverage count; coverage > 0 at position i means output[i] is covered by at least one match. Scan this coverage to produce maximal covered segments, merging overlapping or touching matches into one region. To annotate each region with source indices, sweep through the sorted matches while iterating over segments, maintaining a min-heap by match end and a multiset (via counts) of active matches. After adding matches with start < segment_end and removing matches with end <= segment_start, the keys of the active map are exactly the indices of sources whose matches overlap the segment. Emit each region as <tag>text</tag> followed by the sorted distinct indices in [i] form.

Time complexity: O(n + sum_len + M log M), where n = len(output), sum_len = sum(len(s) for s in sources), and M is the number of matches. Space complexity: O(sum_len + M + n).

Hints

Use a multi-pattern automaton (Aho-Corasick) to find all matches efficiently and count overlaps.
Build a coverage array via a difference array to merge overlapping or touching matches into maximal regions.
To annotate each region with source indices, sweep matches alongside regions and track active matches that overlap the region.

Quick Overview

This question evaluates skills in string matching, pattern recognition, overlap handling, text annotation, and counting occurrences across multiple source strings.