PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Coding & Algorithms/xAI

Greedy Longest-Match Tokenizer for an LLM Data Pipeline

Last updated: Jul 2, 2026

Greedy Longest-Match Tokenizer for an LLM Data Pipeline

Company: xAI

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Onsite

You are building a text pre-processing step for a large-language-model training pipeline. Before raw text can be fed into the model, it must be split into tokens drawn from a fixed vocabulary. Your job is to implement the tokenizer. You are given: - `vocab`: a list of distinct, non-empty token strings (the vocabulary). - `text`: the input string to tokenize. Tokenize `text` using **greedy longest-match** (maximal munch), scanning left to right: 1. At the current position `i`, find the **longest** token in `vocab` that is a prefix of `text[i:]`. 2. If such a token exists, emit it and advance `i` by its length. 3. If **no** vocabulary token matches at position `i`, emit the single character `text[i]` as its own fallback token and advance `i` by 1. 4. Repeat until the entire string is consumed. Return the resulting list of tokens, in order. Note that the algorithm is strictly greedy and never backtracks: it always takes the longest match at the current position, even if a shorter match would allow more of the remaining string to be covered by vocabulary tokens later. **Example 1** ``` vocab = ["low", "lower", "new", "est", "wid", "er"] text = "lowernewestwidget" Output: ["lower", "new", "est", "wid", "g", "e", "t"] ``` At position 0 both `"low"` and `"lower"` match; the longer `"lower"` is chosen. After `"wid"`, the remaining text is `"get"`: no vocabulary token is a prefix of `"get"`, `"et"`, or `"t"`, so each character is emitted as a fallback token. **Example 2** ``` vocab = ["ab", "abc", "cd"] text = "abcd" Output: ["abc", "d"] ``` Greedy longest-match takes `"abc"` at position 0, leaving `"d"` unmatched — even though the alternative segmentation `["ab", "cd"]` would have covered the whole string with vocabulary tokens. The tokenizer must not backtrack. **Example 3** ``` vocab = ["hello"] text = "" Output: [] ``` **Constraints** - `0 <= len(text) <= 10^6` - `1 <= len(vocab) <= 10^5` - `1 <= len(vocab[i]) <= 100`; the sum of all token lengths does not exceed `10^6` - All vocabulary tokens are distinct - `text` and all tokens consist of printable ASCII characters (no whitespace inside vocabulary tokens is guaranteed only if the input provides none; treat characters generically) Your solution should be efficient enough to handle the largest inputs: a scan that re-checks the whole vocabulary at every position is too slow. Aim for a data structure that lets you find the longest matching token starting at a given position in time proportional to the match length, such that the total running time is roughly `O(len(text) * L + total vocabulary size)`, where `L` is the maximum token length.

Related Interview Questions

  • Flatten and unflatten nested Python structures - xAI (nan)
  • Compute dasher pay from order events - xAI (nan)
  • Compute total active time per Twitter Space - xAI (medium)
  • Design a Recoverable Iterator - xAI (medium)
  • Implement Distributed Matrix Multiplication - xAI (hard)
|Home/Coding & Algorithms/xAI

Greedy Longest-Match Tokenizer for an LLM Data Pipeline

xAI logo
xAI
May 30, 2025, 12:00 AM
hardMachine Learning EngineerOnsiteCoding & Algorithms
0
0

You are building a text pre-processing step for a large-language-model training pipeline. Before raw text can be fed into the model, it must be split into tokens drawn from a fixed vocabulary. Your job is to implement the tokenizer.

You are given:

  • vocab : a list of distinct, non-empty token strings (the vocabulary).
  • text : the input string to tokenize.

Tokenize text using greedy longest-match (maximal munch), scanning left to right:

  1. At the current position i , find the longest token in vocab that is a prefix of text[i:] .
  2. If such a token exists, emit it and advance i by its length.
  3. If no vocabulary token matches at position i , emit the single character text[i] as its own fallback token and advance i by 1.
  4. Repeat until the entire string is consumed.

Return the resulting list of tokens, in order.

Note that the algorithm is strictly greedy and never backtracks: it always takes the longest match at the current position, even if a shorter match would allow more of the remaining string to be covered by vocabulary tokens later.

Example 1

vocab = ["low", "lower", "new", "est", "wid", "er"]
text  = "lowernewestwidget"

Output: ["lower", "new", "est", "wid", "g", "e", "t"]

At position 0 both "low" and "lower" match; the longer "lower" is chosen. After "wid", the remaining text is "get": no vocabulary token is a prefix of "get", "et", or "t", so each character is emitted as a fallback token.

Example 2

vocab = ["ab", "abc", "cd"]
text  = "abcd"

Output: ["abc", "d"]

Greedy longest-match takes "abc" at position 0, leaving "d" unmatched — even though the alternative segmentation ["ab", "cd"] would have covered the whole string with vocabulary tokens. The tokenizer must not backtrack.

Example 3

vocab = ["hello"]
text  = ""

Output: []

Constraints

  • 0 <= len(text) <= 10^6
  • 1 <= len(vocab) <= 10^5
  • 1 <= len(vocab[i]) <= 100 ; the sum of all token lengths does not exceed 10^6
  • All vocabulary tokens are distinct
  • text and all tokens consist of printable ASCII characters (no whitespace inside vocabulary tokens is guaranteed only if the input provides none; treat characters generically)

Your solution should be efficient enough to handle the largest inputs: a scan that re-checks the whole vocabulary at every position is too slow. Aim for a data structure that lets you find the longest matching token starting at a given position in time proportional to the match length, such that the total running time is roughly O(len(text) * L + total vocabulary size), where L is the maximum token length.

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More xAI•More Machine Learning Engineer•xAI Machine Learning Engineer•xAI Coding & Algorithms•Machine Learning Engineer Coding & Algorithms
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.