Implement Longest-Match Text Replacement
Company: Google
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Technical Screen
You are given a text string and a dictionary of replacement entries. Each dictionary entry has the form `token:id`, where `token` is a string pattern and `id` is the value that should be emitted when that token is matched.
Implement a function that scans the text from left to right and produces an output sequence using greedy longest-prefix matching:
1. At the current text position, find all dictionary tokens that match the substring starting at that position.
2. If at least one token matches, choose the longest matching token, append its corresponding `id` to the output, and advance by the length of that token.
3. If no token matches, append the original character at the current position to the output and advance by one character.
4. Continue until the entire text has been processed.
Example:
```text
text = "abcdxyz"
dictionary = {
"ab": "1",
"abc": "2",
"xy": "3"
}
```
Output:
```text
["2", "d", "3", "z"]
```
Explanation: At index `0`, both `"ab"` and `"abc"` match, so choose the longest token `"abc"` and output `"2"`. Then `"d"` has no match, so output it literally. Then `"xy"` matches and outputs `"3"`. Finally, `"z"` is output literally.
Follow-up: How would you adapt your design if the matching priority changed from longest match to a configurable priority rule, such as dictionary insertion order, highest score, or a custom comparator?
Quick Answer: This question evaluates string processing, pattern matching, greedy algorithm reasoning, and data structure design for efficient longest-prefix token replacement, and is categorized under Coding & Algorithms.
Part 1: Longest-Match Text Replacement
Given a text string and a dictionary that maps token strings to replacement id strings, scan the text from left to right.
At each position:
1. Find every token that matches the substring starting at that position.
2. If at least one token matches, choose the longest matching token, append its replacement id to the output, and advance by the token length.
3. If no token matches, append the original character at that position as a one-character string and advance by 1.
Return the final output as a list of strings.
Constraints
- 0 <= len(text) <= 100000
- 0 <= len(dictionary) <= 10000
- Each token in dictionary is non-empty.
- The sum of all token lengths is at most 100000.
- Matching is case-sensitive.
Examples
Input: ("abcdxyz", {"ab": "1", "abc": "2", "xy": "3"})
Expected Output: ["2", "d", "3", "z"]
Explanation: Both 'ab' and 'abc' match at index 0, so the longer token 'abc' is chosen.
Input: ("aaaaa", {"a": "X", "aa": "Y"})
Expected Output: ["Y", "Y", "X"]
Explanation: Greedy longest matching picks 'aa' twice, then the final 'a'.
Input: ("cat", {})
Expected Output: ["c", "a", "t"]
Explanation: With an empty dictionary, every character is emitted literally.
Input: ("", {"a": "1"})
Expected Output: []
Explanation: An empty text produces an empty output.
Hints
- A trie lets you test all token prefixes starting at one position without checking every token from scratch.
- While walking forward from a position, remember the last trie node that represented a complete token.
Part 2: Configurable-Priority Text Replacement
Implement the same left-to-right text replacement process, but the match-selection rule is configurable.
You are given:
- a text string
- a list of replacement entries in insertion order
- a priority rule
Each replacement entry is a tuple: (token, replacement_id, score).
At each text position, find all entries whose token matches the substring starting there. If one or more entries match, choose exactly one using the selected priority rule:
- 'longest': choose the longest matching token; if tied, choose the earlier entry in the list
- 'insertion': choose the earliest matching entry in the input list
- 'score': choose the matching entry with the highest score; if tied, choose the longer token; if still tied, choose the earlier entry in the list
Append the chosen replacement_id to the output and advance by the chosen token length. If nothing matches, append the original character and advance by 1.
Return the final output as a list of strings.
Constraints
- 0 <= len(text) <= 100000
- 0 <= len(entries) <= 10000
- Each token is non-empty.
- Tokens in entries are unique.
- The sum of all token lengths is at most 100000.
- priority is one of 'longest', 'insertion', or 'score'.
- Matching is case-sensitive.
Examples
Input: ("abcdxyz", [("ab", "1", 5), ("abc", "2", 1), ("xy", "3", 7)], "longest")
Expected Output: ["2", "d", "3", "z"]
Explanation: Under 'longest', 'abc' beats 'ab' at index 0.
Input: ("abcd", [("ab", "X", 0), ("abc", "Y", 10)], "insertion")
Expected Output: ["X", "c", "d"]
Explanation: Under 'insertion', the earlier entry 'ab' wins even though 'abc' is longer.
Input: ("abxyz", [("ab", "X", 10), ("abx", "Y", 3), ("xy", "Q", 5)], "score")
Expected Output: ["X", "Q", "z"]
Explanation: At index 0, 'ab' has a higher score than 'abx', so 'X' is emitted first.
Input: ("abcd", [("ab", "X", 5), ("abc", "Y", 5)], "score")
Expected Output: ["Y", "d"]
Explanation: Scores tie, so the longer token 'abc' is chosen.
Input: ("hi", [], "longest")
Expected Output: ["h", "i"]
Explanation: With no entries, all characters are emitted literally.
Hints
- A trie still works well here: every possible match at one position appears along one path in the trie.
- Write a helper that decides whether one matched entry is better than another under the chosen priority rule.