Implement a longest-match tokenizer
Company: Anthropic
Role: Software Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates string-processing and algorithmic implementation skills for greedy longest-match tokenization, including handling runs of unmatched characters and reducing unnecessary comparisons when the vocabulary is small.
Part 1: Implement greedy longest-match tokenizer
Constraints
- 0 <= len(text) <= 10^4
- 0 <= len(vocab) <= 10^3
- All vocabulary keys are unique, non-empty strings
- Token IDs are integers
Examples
Input: ({'a': 1, 'ab': 2, 'abc': 3, 'bc': 4}, 'abcabx')
Expected Output: [3, 2, -1]
Explanation: At index 0, 'abc' is the longest match. Then 'ab' matches at index 3. 'x' is unmatched.
Input: ({'cat': 10, 'c': 1, 'at': 2}, 'catat')
Expected Output: [10, 2]
Explanation: 'cat' is chosen over 'c' at the start because it is longer. Then 'at' matches.
Input: ({'ab': 1, 'aba': 2, 'ba': 3}, 'ababa')
Expected Output: [2, 3]
Explanation: 'aba' matches first, then 'ba' matches the remaining suffix.