How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a hard difficulty Coding & Algorithms question, commonly asked during Onsite rounds at xAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at xAI during technical interviews.

Greedy Longest-Match Tokenizer for an LLM Data Pipeline

Q: Greedy Longest-Match Tokenizer for an LLM Data Pipeline

This question evaluates understanding of string-processing and greedy maximal-munch tokenization, as well as competency in designing efficient prefix-matching data structures and handling large-scale text preprocessing for LLM pipelines.

Company: xAI

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Onsite

You are building a text pre-processing step for a large-language-model training pipeline. Before raw text can be fed into the model, it must be split into tokens drawn from a fixed vocabulary. Your job is to implement the tokenizer. You are given: - `vocab`: a list of distinct, non-empty token strings (the vocabulary). - `text`: the input string to tokenize. Tokenize `text` using **greedy longest-match** (maximal munch), scanning left to right: 1. At the current position `i`, find the **longest** token in `vocab` that is a prefix of `text[i:]`. 2. If such a token exists, emit it and advance `i` by its length. 3. If **no** vocabulary token matches at position `i`, emit the single character `text[i]` as its own fallback token and advance `i` by 1. 4. Repeat until the entire string is consumed. Return the resulting list of tokens, in order. Note that the algorithm is strictly greedy and never backtracks: it always takes the longest match at the current position, even if a shorter match would allow more of the remaining string to be covered by vocabulary tokens later. **Example 1** ``` vocab = ["low", "lower", "new", "est", "wid", "er"] text = "lowernewestwidget" Output: ["lower", "new", "est", "wid", "g", "e", "t"] ``` At position 0 both `"low"` and `"lower"` match; the longer `"lower"` is chosen. After `"wid"`, the remaining text is `"get"`: no vocabulary token is a prefix of `"get"`, `"et"`, or `"t"`, so each character is emitted as a fallback token. **Example 2** ``` vocab = ["ab", "abc", "cd"] text = "abcd" Output: ["abc", "d"] ``` Greedy longest-match takes `"abc"` at position 0, leaving `"d"` unmatched — even though the alternative segmentation `["ab", "cd"]` would have covered the whole string with vocabulary tokens. The tokenizer must not backtrack. **Example 3** ``` vocab = ["hello"] text = "" Output: [] ``` **Constraints** - `0 <= len(text) <= 10^6` - `1 <= len(vocab) <= 10^5` - `1 <= len(vocab[i]) <= 100`; the sum of all token lengths does not exceed `10^6` - All vocabulary tokens are distinct - `text` and all tokens consist of printable ASCII characters (no whitespace inside vocabulary tokens is guaranteed only if the input provides none; treat characters generically) Your solution should be efficient enough to handle the largest inputs: a scan that re-checks the whole vocabulary at every position is too slow. Aim for a data structure that lets you find the longest matching token starting at a given position in time proportional to the match length, such that the total running time is roughly `O(len(text) * L + total vocabulary size)`, where `L` is the maximum token length.

Quick Answer: This question evaluates understanding of string-processing and greedy maximal-munch tokenization, as well as competency in designing efficient prefix-matching data structures and handling large-scale text preprocessing for LLM pipelines.

You are building a text pre-processing step for a large-language-model training pipeline. Before raw text can be fed into the model, it must be split into tokens drawn from a fixed vocabulary. Implement the tokenizer. You are given: - `vocab`: a list of distinct, non-empty token strings (the vocabulary). - `text`: the input string to tokenize. Tokenize `text` using **greedy longest-match** (maximal munch), scanning left to right: 1. At the current position `i`, find the **longest** token in `vocab` that is a prefix of `text[i:]`. 2. If such a token exists, emit it and advance `i` by its length. 3. If **no** vocabulary token matches at position `i`, emit the single character `text[i]` as its own fallback token and advance `i` by 1. 4. Repeat until the entire string is consumed. Return the resulting list of tokens, in order. The algorithm is strictly greedy and never backtracks: it always takes the longest match at the current position, even if a shorter match would allow more of the remaining string to be covered by vocabulary tokens later. **Example 1** ``` vocab = ["low", "lower", "new", "est", "wid", "er"] text = "lowernewestwidget" Output: ["lower", "new", "est", "wid", "g", "e", "t"] ``` At position 0 both "low" and "lower" match; the longer "lower" is chosen. After "wid", the remaining text is "get": no token is a prefix of "get", "et", or "t", so each character is emitted as a fallback token. **Example 2** ``` vocab = ["ab", "abc", "cd"] text = "abcd" Output: ["abc", "d"] ``` Greedy longest-match takes "abc" at position 0, leaving "d" unmatched — even though ["ab", "cd"] would cover the whole string. The tokenizer must not backtrack. **Example 3** ``` vocab = ["hello"] text = "" Output: [] ``` Aim for a data structure (a trie) that finds the longest matching token starting at a given position in time proportional to the match length, for total running time roughly O(len(text) * L + total vocabulary size), where L is the maximum token length.

Constraints

0 <= len(text) <= 10^6
1 <= len(vocab) <= 10^5
1 <= len(vocab[i]) <= 100; the sum of all token lengths does not exceed 10^6
All vocabulary tokens are distinct
text and all tokens consist of printable ASCII characters

Examples

Input: (["low", "lower", "new", "est", "wid", "er"], "lowernewestwidget")

Expected Output: ["lower", "new", "est", "wid", "g", "e", "t"]

Explanation: At position 0, both 'low' and 'lower' are prefixes; greedy longest-match picks 'lower'. 'new' and 'est' and 'wid' match next. The remaining 'get' has no matching prefix at any position, so 'g', 'e', 't' are emitted as single-character fallbacks.

Input: (["ab", "abc", "cd"], "abcd")

Expected Output: ["abc", "d"]

Explanation: 'abc' is the longest prefix match at position 0, consuming through index 3. That leaves 'd', which matches no token, so it is a fallback. The tokenizer does not backtrack to the alternative ['ab', 'cd'].

Input: (["hello"], "")

Expected Output: []

Explanation: Empty input yields an empty token list.

Input: (["ab"], "xyz")

Expected Output: ["x", "y", "z"]

Explanation: No vocabulary token is a prefix of 'xyz' at any position, so every character is emitted as its own fallback token.

Input: (["a"], "aaa")

Expected Output: ["a", "a", "a"]

Explanation: The single-character token 'a' matches at each position, producing three 'a' tokens.

Input: (["er"], "lowerer")

Expected Output: ["l", "o", "w", "er", "er"]

Explanation: 'l', 'o', 'w' fall back to single characters since no token matches there; then 'er' matches twice for the trailing 'erer'.

Input: (["aa", "aaa", "aaaa"], "aaaaa")

Expected Output: ["aaaa", "a"]

Explanation: Overlapping-length tokens: the longest prefix at position 0 is 'aaaa', leaving a single 'a' which matches no longer token but note only 'aa','aaa','aaaa' exist — the trailing single 'a' has no vocabulary match, so it is a fallback token.

Hints

Build a trie over the vocabulary once. From each starting position, walk the trie character by character and remember the end index of the last position that completed a full vocabulary token — that is the longest match.
If no token completes before the walk stops, emit the single character text[i] as a fallback and advance by 1. Never backtrack: always take the longest match at the current position, then continue from where it ends.
Re-scanning the entire vocabulary at every position is O(len(text) * |vocab| * L) and too slow; the trie collapses shared prefixes so each position costs only O(L).

Quick Overview

This question evaluates understanding of string-processing and greedy maximal-munch tokenization, as well as competency in designing efficient prefix-matching data structures and handling large-scale text preprocessing for LLM pipelines.