How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

What difficulty level is this interview question?

This is a hard difficulty Coding & Algorithms question, commonly asked during Onsite rounds at xAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at xAI during technical interviews.

Greedy Longest-Match Tokenizer for an LLM Data Pipeline

Q: Greedy Longest-Match Tokenizer for an LLM Data Pipeline

This is a Coding & Algorithms interview question from xAI for Machine Learning Engineer roles. View the full question and solution on PracHub.

Company: xAI

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Onsite

You are building a text pre-processing step for a large-language-model training pipeline. Before raw text can be fed into the model, it must be split into tokens drawn from a fixed vocabulary. Your job is to implement the tokenizer. You are given: - `vocab`: a list of distinct, non-empty token strings (the vocabulary). - `text`: the input string to tokenize. Tokenize `text` using **greedy longest-match** (maximal munch), scanning left to right: 1. At the current position `i`, find the **longest** token in `vocab` that is a prefix of `text[i:]`. 2. If such a token exists, emit it and advance `i` by its length. 3. If **no** vocabulary token matches at position `i`, emit the single character `text[i]` as its own fallback token and advance `i` by 1. 4. Repeat until the entire string is consumed. Return the resulting list of tokens, in order. Note that the algorithm is strictly greedy and never backtracks: it always takes the longest match at the current position, even if a shorter match would allow more of the remaining string to be covered by vocabulary tokens later. **Example 1** ``` vocab = ["low", "lower", "new", "est", "wid", "er"] text = "lowernewestwidget" Output: ["lower", "new", "est", "wid", "g", "e", "t"] ``` At position 0 both `"low"` and `"lower"` match; the longer `"lower"` is chosen. After `"wid"`, the remaining text is `"get"`: no vocabulary token is a prefix of `"get"`, `"et"`, or `"t"`, so each character is emitted as a fallback token. **Example 2** ``` vocab = ["ab", "abc", "cd"] text = "abcd" Output: ["abc", "d"] ``` Greedy longest-match takes `"abc"` at position 0, leaving `"d"` unmatched — even though the alternative segmentation `["ab", "cd"]` would have covered the whole string with vocabulary tokens. The tokenizer must not backtrack. **Example 3** ``` vocab = ["hello"] text = "" Output: [] ``` **Constraints** - `0 <= len(text) <= 10^6` - `1 <= len(vocab) <= 10^5` - `1 <= len(vocab[i]) <= 100`; the sum of all token lengths does not exceed `10^6` - All vocabulary tokens are distinct - `text` and all tokens consist of printable ASCII characters (no whitespace inside vocabulary tokens is guaranteed only if the input provides none; treat characters generically) Your solution should be efficient enough to handle the largest inputs: a scan that re-checks the whole vocabulary at every position is too slow. Aim for a data structure that lets you find the longest matching token starting at a given position in time proportional to the match length, such that the total running time is roughly `O(len(text) * L + total vocabulary size)`, where `L` is the maximum token length.

You are given:

vocab : a list of distinct, non-empty token strings (the vocabulary).
text : the input string to tokenize.

Tokenize text using greedy longest-match (maximal munch), scanning left to right:

At the current position i , find the longest token in vocab that is a prefix of text[i:] .
If such a token exists, emit it and advance i by its length.
If no vocabulary token matches at position i , emit the single character text[i] as its own fallback token and advance i by 1.
Repeat until the entire string is consumed.

Return the resulting list of tokens, in order.

Note that the algorithm is strictly greedy and never backtracks: it always takes the longest match at the current position, even if a shorter match would allow more of the remaining string to be covered by vocabulary tokens later.

Example 1

vocab = ["low", "lower", "new", "est", "wid", "er"]
text  = "lowernewestwidget"

Output: ["lower", "new", "est", "wid", "g", "e", "t"]

At position 0 both "low" and "lower" match; the longer "lower" is chosen. After "wid", the remaining text is "get": no vocabulary token is a prefix of "get", "et", or "t", so each character is emitted as a fallback token.

Example 2

vocab = ["ab", "abc", "cd"]
text  = "abcd"

Output: ["abc", "d"]

Greedy longest-match takes "abc" at position 0, leaving "d" unmatched — even though the alternative segmentation ["ab", "cd"] would have covered the whole string with vocabulary tokens. The tokenizer must not backtrack.

Example 3

vocab = ["hello"]
text  = ""

Output: []

Constraints

0 <= len(text) <= 10^6
1 <= len(vocab) <= 10^5
1 <= len(vocab[i]) <= 100 ; the sum of all token lengths does not exceed 10^6
All vocabulary tokens are distinct
text and all tokens consist of printable ASCII characters (no whitespace inside vocabulary tokens is guaranteed only if the input provides none; treat characters generically)

Your solution should be efficient enough to handle the largest inputs: a scan that re-checks the whole vocabulary at every position is too slow. Aim for a data structure that lets you find the longest matching token starting at a given position in time proportional to the match length, such that the total running time is roughly O(len(text) * L + total vocabulary size), where L is the maximum token length.

Greedy Longest-Match Tokenizer for an LLM Data Pipeline

Company: xAI

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Onsite

You are given:

vocab : a list of distinct, non-empty token strings (the vocabulary).
text : the input string to tokenize.

Tokenize text using greedy longest-match (maximal munch), scanning left to right:

At the current position i , find the longest token in vocab that is a prefix of text[i:] .
If such a token exists, emit it and advance i by its length.
If no vocabulary token matches at position i , emit the single character text[i] as its own fallback token and advance i by 1.
Repeat until the entire string is consumed.

Return the resulting list of tokens, in order.

Example 1

vocab = ["low", "lower", "new", "est", "wid", "er"]
text  = "lowernewestwidget"

Output: ["lower", "new", "est", "wid", "g", "e", "t"]

Example 2

vocab = ["ab", "abc", "cd"]
text  = "abcd"

Output: ["abc", "d"]

Example 3

vocab = ["hello"]
text  = ""

Output: []

Constraints

0 <= len(text) <= 10^6
1 <= len(vocab) <= 10^5
1 <= len(vocab[i]) <= 100 ; the sum of all token lengths does not exceed 10^6
All vocabulary tokens are distinct
text and all tokens consist of printable ASCII characters (no whitespace inside vocabulary tokens is guaranteed only if the input provides none; treat characters generically)

Greedy Longest-Match Tokenizer for an LLM Data Pipeline

Greedy Longest-Match Tokenizer for an LLM Data Pipeline

Submit Your Answer to Earn 20XP

Greedy Longest-Match Tokenizer for an LLM Data Pipeline

Greedy Longest-Match Tokenizer for an LLM Data Pipeline

Submit Your Answer to Earn 20XP