Implement a longest-match tokenizer
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates string-processing and algorithmic implementation skills for greedy longest-match tokenization, including handling runs of unmatched characters and reducing unnecessary comparisons when the vocabulary is small.
Part 1: Implement greedy longest-match tokenizer
Constraints
- 0 <= len(text) <= 10^4
- 0 <= len(vocab) <= 10^3
- All vocabulary keys are unique, non-empty strings
- Token IDs are integers
Examples
Input: ({'a': 1, 'ab': 2, 'abc': 3, 'bc': 4}, 'abcabx')
Expected Output: [3, 2, -1]
Explanation: At index 0, 'abc' is the longest match. Then 'ab' matches at index 3. 'x' is unmatched.
Input: ({'cat': 10, 'c': 1, 'at': 2}, 'catat')
Expected Output: [10, 2]
Explanation: 'cat' is chosen over 'c' at the start because it is longer. Then 'at' matches.
Input: ({'ab': 1, 'aba': 2, 'ba': 3}, 'ababa')
Expected Output: [2, 3]
Explanation: 'aba' matches first, then 'ba' matches the remaining suffix.
Input: ({'a': 1}, '')
Expected Output: []
Explanation: Empty text produces no tokens.
Input: ({}, 'hi')
Expected Output: [-1, -1]
Explanation: With an empty vocabulary, each character is unmatched.
Hints
- If you check vocabulary entries in descending order of length, the first match you find at a position is the greedy longest match.
- When nothing matches, append -1 and move forward by exactly one character.
Part 2: Collapse consecutive unmatched characters into one -1
Constraints
- 0 <= len(text) <= 10^4
- 0 <= len(vocab) <= 10^3
- All vocabulary keys are unique, non-empty strings
- Token IDs are integers
Examples
Input: ({'ab': 1, 'c': 2}, 'abxxczzz')
Expected Output: [1, -1, 2, -1]
Explanation: 'ab' matches, then 'xx' is one unmatched run, then 'c' matches, then 'zzz' is another unmatched run.
Input: ({'a': 1, 'aa': 2}, 'bbaaaac')
Expected Output: [-1, 2, 2, -1]
Explanation: 'bb' becomes one -1, then 'aa' and 'aa' match greedily, and the final 'c' becomes one -1.
Input: ({'xyz': 7}, 'abc')
Expected Output: [-1]
Explanation: All characters are unmatched, so they collapse into one -1.
Input: ({'a': 1}, '')
Expected Output: []
Explanation: Empty text produces no tokens.
Input: ({}, 'aab')
Expected Output: [-1]
Explanation: With an empty vocabulary, the whole text is one unmatched run.
Hints
- Keep a flag that remembers whether you are currently inside an unmatched run.
- Any successful token match should end an unmatched run.
Part 3: Optimize longest-match tokenization for a small vocabulary
Constraints
- 0 <= len(text) <= 10^4
- 0 <= len(vocab) <= 10^3
- All vocabulary keys are unique, non-empty strings
- Do not use a trie
- Token IDs are integers
Examples
Input: ({'the': 1, 'th': 2, 'he': 3, 'a': 4}, 'thea!')
Expected Output: [1, 4, -1]
Explanation: At the start, only 'the' and 'th' need to be checked because text begins with 't'. 'the' is the greedy match.
Input: ({'x': 5, 'xy': 6, 'xyz': 7, 'ab': 1}, 'xyzxaby')
Expected Output: [7, 5, 1, -1]
Explanation: 'xyz' matches first, then 'x', then 'ab', and finally 'y' is unmatched.
Input: ({'aa': 1, 'ab': 2, 'b': 3}, 'abbaa')
Expected Output: [2, 3, 1]
Explanation: 'ab' matches first, then 'b', then 'aa'. Grouping by first character avoids checking impossible candidates.
Input: ({'a': 1}, '')
Expected Output: []
Explanation: Empty text produces no tokens.
Input: ({}, 'ab')
Expected Output: [-1, -1]
Explanation: With no vocabulary entries, every character is unmatched.
Hints
- A vocabulary word can only match at position i if its first character equals text[i].
- Sort each first-character group by descending length so the first successful match is still the greedy one.