Greedy Longest-Match Tokenizer for an LLM Data Pipeline
Company: xAI
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Onsite
Company: xAI
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Onsite
You are building a text pre-processing step for a large-language-model training pipeline. Before raw text can be fed into the model, it must be split into tokens drawn from a fixed vocabulary. Your job is to implement the tokenizer.
You are given:
vocab
: a list of distinct, non-empty token strings (the vocabulary).
text
: the input string to tokenize.
Tokenize text using greedy longest-match (maximal munch), scanning left to right:
i
, find the
longest
token in
vocab
that is a prefix of
text[i:]
.
i
by its length.
i
, emit the single character
text[i]
as its own fallback token and advance
i
by 1.
Return the resulting list of tokens, in order.
Note that the algorithm is strictly greedy and never backtracks: it always takes the longest match at the current position, even if a shorter match would allow more of the remaining string to be covered by vocabulary tokens later.
Example 1
vocab = ["low", "lower", "new", "est", "wid", "er"]
text = "lowernewestwidget"
Output: ["lower", "new", "est", "wid", "g", "e", "t"]
At position 0 both "low" and "lower" match; the longer "lower" is chosen. After "wid", the remaining text is "get": no vocabulary token is a prefix of "get", "et", or "t", so each character is emitted as a fallback token.
Example 2
vocab = ["ab", "abc", "cd"]
text = "abcd"
Output: ["abc", "d"]
Greedy longest-match takes "abc" at position 0, leaving "d" unmatched — even though the alternative segmentation ["ab", "cd"] would have covered the whole string with vocabulary tokens. The tokenizer must not backtrack.
Example 3
vocab = ["hello"]
text = ""
Output: []
Constraints
0 <= len(text) <= 10^6
1 <= len(vocab) <= 10^5
1 <= len(vocab[i]) <= 100
; the sum of all token lengths does not exceed
10^6
text
and all tokens consist of printable ASCII characters (no whitespace inside vocabulary tokens is guaranteed only if the input provides none; treat characters generically)
Your solution should be efficient enough to handle the largest inputs: a scan that re-checks the whole vocabulary at every position is too slow. Aim for a data structure that lets you find the longest matching token starting at a given position in time proportional to the match length, such that the total running time is roughly O(len(text) * L + total vocabulary size), where L is the maximum token length.