Greedy Longest-Match Tokenizer for an LLM Data Pipeline
Company: xAI
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Onsite
You are building a text pre-processing step for a large-language-model training pipeline. Before raw text can be fed into the model, it must be split into tokens drawn from a fixed vocabulary. Your job is to implement the tokenizer.
You are given:
- `vocab`: a list of distinct, non-empty token strings (the vocabulary).
- `text`: the input string to tokenize.
Tokenize `text` using **greedy longest-match** (maximal munch), scanning left to right:
1. At the current position `i`, find the **longest** token in `vocab` that is a prefix of `text[i:]`.
2. If such a token exists, emit it and advance `i` by its length.
3. If **no** vocabulary token matches at position `i`, emit the single character `text[i]` as its own fallback token and advance `i` by 1.
4. Repeat until the entire string is consumed.
Return the resulting list of tokens, in order.
Note that the algorithm is strictly greedy and never backtracks: it always takes the longest match at the current position, even if a shorter match would allow more of the remaining string to be covered by vocabulary tokens later.
**Example 1**
```
vocab = ["low", "lower", "new", "est", "wid", "er"]
text = "lowernewestwidget"
Output: ["lower", "new", "est", "wid", "g", "e", "t"]
```
At position 0 both `"low"` and `"lower"` match; the longer `"lower"` is chosen. After `"wid"`, the remaining text is `"get"`: no vocabulary token is a prefix of `"get"`, `"et"`, or `"t"`, so each character is emitted as a fallback token.
**Example 2**
```
vocab = ["ab", "abc", "cd"]
text = "abcd"
Output: ["abc", "d"]
```
Greedy longest-match takes `"abc"` at position 0, leaving `"d"` unmatched — even though the alternative segmentation `["ab", "cd"]` would have covered the whole string with vocabulary tokens. The tokenizer must not backtrack.
**Example 3**
```
vocab = ["hello"]
text = ""
Output: []
```
**Constraints**
- `0 <= len(text) <= 10^6`
- `1 <= len(vocab) <= 10^5`
- `1 <= len(vocab[i]) <= 100`; the sum of all token lengths does not exceed `10^6`
- All vocabulary tokens are distinct
- `text` and all tokens consist of printable ASCII characters (no whitespace inside vocabulary tokens is guaranteed only if the input provides none; treat characters generically)
Your solution should be efficient enough to handle the largest inputs: a scan that re-checks the whole vocabulary at every position is too slow. Aim for a data structure that lets you find the longest matching token starting at a given position in time proportional to the match length, such that the total running time is roughly `O(len(text) * L + total vocabulary size)`, where `L` is the maximum token length.
Quick Answer: This question evaluates understanding of string-processing and greedy maximal-munch tokenization, as well as competency in designing efficient prefix-matching data structures and handling large-scale text preprocessing for LLM pipelines.
You are building a text pre-processing step for a large-language-model training pipeline. Before raw text can be fed into the model, it must be split into tokens drawn from a fixed vocabulary. Implement the tokenizer.
You are given:
- `vocab`: a list of distinct, non-empty token strings (the vocabulary).
- `text`: the input string to tokenize.
Tokenize `text` using **greedy longest-match** (maximal munch), scanning left to right:
1. At the current position `i`, find the **longest** token in `vocab` that is a prefix of `text[i:]`.
2. If such a token exists, emit it and advance `i` by its length.
3. If **no** vocabulary token matches at position `i`, emit the single character `text[i]` as its own fallback token and advance `i` by 1.
4. Repeat until the entire string is consumed.
Return the resulting list of tokens, in order. The algorithm is strictly greedy and never backtracks: it always takes the longest match at the current position, even if a shorter match would allow more of the remaining string to be covered by vocabulary tokens later.
**Example 1**
```
vocab = ["low", "lower", "new", "est", "wid", "er"]
text = "lowernewestwidget"
Output: ["lower", "new", "est", "wid", "g", "e", "t"]
```
At position 0 both "low" and "lower" match; the longer "lower" is chosen. After "wid", the remaining text is "get": no token is a prefix of "get", "et", or "t", so each character is emitted as a fallback token.
**Example 2**
```
vocab = ["ab", "abc", "cd"]
text = "abcd"
Output: ["abc", "d"]
```
Greedy longest-match takes "abc" at position 0, leaving "d" unmatched — even though ["ab", "cd"] would cover the whole string. The tokenizer must not backtrack.
**Example 3**
```
vocab = ["hello"]
text = ""
Output: []
```
Aim for a data structure (a trie) that finds the longest matching token starting at a given position in time proportional to the match length, for total running time roughly O(len(text) * L + total vocabulary size), where L is the maximum token length.
Constraints
- 0 <= len(text) <= 10^6
- 1 <= len(vocab) <= 10^5
- 1 <= len(vocab[i]) <= 100; the sum of all token lengths does not exceed 10^6
- All vocabulary tokens are distinct
- text and all tokens consist of printable ASCII characters
Examples
Input: (["low", "lower", "new", "est", "wid", "er"], "lowernewestwidget")
Expected Output: ["lower", "new", "est", "wid", "g", "e", "t"]
Explanation: At position 0, both 'low' and 'lower' are prefixes; greedy longest-match picks 'lower'. 'new' and 'est' and 'wid' match next. The remaining 'get' has no matching prefix at any position, so 'g', 'e', 't' are emitted as single-character fallbacks.
Input: (["ab", "abc", "cd"], "abcd")
Expected Output: ["abc", "d"]
Explanation: 'abc' is the longest prefix match at position 0, consuming through index 3. That leaves 'd', which matches no token, so it is a fallback. The tokenizer does not backtrack to the alternative ['ab', 'cd'].
Input: (["hello"], "")
Expected Output: []
Explanation: Empty input yields an empty token list.
Input: (["ab"], "xyz")
Expected Output: ["x", "y", "z"]
Explanation: No vocabulary token is a prefix of 'xyz' at any position, so every character is emitted as its own fallback token.
Input: (["a"], "aaa")
Expected Output: ["a", "a", "a"]
Explanation: The single-character token 'a' matches at each position, producing three 'a' tokens.
Input: (["er"], "lowerer")
Expected Output: ["l", "o", "w", "er", "er"]
Explanation: 'l', 'o', 'w' fall back to single characters since no token matches there; then 'er' matches twice for the trailing 'erer'.
Input: (["aa", "aaa", "aaaa"], "aaaaa")
Expected Output: ["aaaa", "a"]
Explanation: Overlapping-length tokens: the longest prefix at position 0 is 'aaaa', leaving a single 'a' which matches no longer token but note only 'aa','aaa','aaaa' exist — the trailing single 'a' has no vocabulary match, so it is a fallback token.
Hints
- Build a trie over the vocabulary once. From each starting position, walk the trie character by character and remember the end index of the last position that completed a full vocabulary token — that is the longest match.
- If no token completes before the walk stops, emit the single character text[i] as a fallback and advance by 1. Never backtrack: always take the longest match at the current position, then continue from where it ends.
- Re-scanning the entire vocabulary at every position is O(len(text) * |vocab| * L) and too slow; the trie collapses shared prefixes so each position costs only O(L).