PracHub
QuestionsCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates understanding of string-processing and greedy maximal-munch tokenization, as well as competency in designing efficient prefix-matching data structures and handling large-scale text preprocessing for LLM pipelines.

  • hard
  • xAI
  • Coding & Algorithms
  • Machine Learning Engineer

Greedy Longest-Match Tokenizer for an LLM Data Pipeline

Company: xAI

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Onsite

You are building a text pre-processing step for a large-language-model training pipeline. Before raw text can be fed into the model, it must be split into tokens drawn from a fixed vocabulary. Your job is to implement the tokenizer. You are given: - `vocab`: a list of distinct, non-empty token strings (the vocabulary). - `text`: the input string to tokenize. Tokenize `text` using **greedy longest-match** (maximal munch), scanning left to right: 1. At the current position `i`, find the **longest** token in `vocab` that is a prefix of `text[i:]`. 2. If such a token exists, emit it and advance `i` by its length. 3. If **no** vocabulary token matches at position `i`, emit the single character `text[i]` as its own fallback token and advance `i` by 1. 4. Repeat until the entire string is consumed. Return the resulting list of tokens, in order. Note that the algorithm is strictly greedy and never backtracks: it always takes the longest match at the current position, even if a shorter match would allow more of the remaining string to be covered by vocabulary tokens later. **Example 1** ``` vocab = ["low", "lower", "new", "est", "wid", "er"] text = "lowernewestwidget" Output: ["lower", "new", "est", "wid", "g", "e", "t"] ``` At position 0 both `"low"` and `"lower"` match; the longer `"lower"` is chosen. After `"wid"`, the remaining text is `"get"`: no vocabulary token is a prefix of `"get"`, `"et"`, or `"t"`, so each character is emitted as a fallback token. **Example 2** ``` vocab = ["ab", "abc", "cd"] text = "abcd" Output: ["abc", "d"] ``` Greedy longest-match takes `"abc"` at position 0, leaving `"d"` unmatched — even though the alternative segmentation `["ab", "cd"]` would have covered the whole string with vocabulary tokens. The tokenizer must not backtrack. **Example 3** ``` vocab = ["hello"] text = "" Output: [] ``` **Constraints** - `0 <= len(text) <= 10^6` - `1 <= len(vocab) <= 10^5` - `1 <= len(vocab[i]) <= 100`; the sum of all token lengths does not exceed `10^6` - All vocabulary tokens are distinct - `text` and all tokens consist of printable ASCII characters (no whitespace inside vocabulary tokens is guaranteed only if the input provides none; treat characters generically) Your solution should be efficient enough to handle the largest inputs: a scan that re-checks the whole vocabulary at every position is too slow. Aim for a data structure that lets you find the longest matching token starting at a given position in time proportional to the match length, such that the total running time is roughly `O(len(text) * L + total vocabulary size)`, where `L` is the maximum token length.

Quick Answer: This question evaluates understanding of string-processing and greedy maximal-munch tokenization, as well as competency in designing efficient prefix-matching data structures and handling large-scale text preprocessing for LLM pipelines.

You are building a text pre-processing step for a large-language-model training pipeline. Before raw text can be fed into the model, it must be split into tokens drawn from a fixed vocabulary. Implement the tokenizer. You are given: - `vocab`: a list of distinct, non-empty token strings (the vocabulary). - `text`: the input string to tokenize. Tokenize `text` using **greedy longest-match** (maximal munch), scanning left to right: 1. At the current position `i`, find the **longest** token in `vocab` that is a prefix of `text[i:]`. 2. If such a token exists, emit it and advance `i` by its length. 3. If **no** vocabulary token matches at position `i`, emit the single character `text[i]` as its own fallback token and advance `i` by 1. 4. Repeat until the entire string is consumed. Return the resulting list of tokens, in order. The algorithm is strictly greedy and never backtracks: it always takes the longest match at the current position, even if a shorter match would allow more of the remaining string to be covered by vocabulary tokens later. **Example 1** ``` vocab = ["low", "lower", "new", "est", "wid", "er"] text = "lowernewestwidget" Output: ["lower", "new", "est", "wid", "g", "e", "t"] ``` At position 0 both "low" and "lower" match; the longer "lower" is chosen. After "wid", the remaining text is "get": no token is a prefix of "get", "et", or "t", so each character is emitted as a fallback token. **Example 2** ``` vocab = ["ab", "abc", "cd"] text = "abcd" Output: ["abc", "d"] ``` Greedy longest-match takes "abc" at position 0, leaving "d" unmatched — even though ["ab", "cd"] would cover the whole string. The tokenizer must not backtrack. **Example 3** ``` vocab = ["hello"] text = "" Output: [] ``` Aim for a data structure (a trie) that finds the longest matching token starting at a given position in time proportional to the match length, for total running time roughly O(len(text) * L + total vocabulary size), where L is the maximum token length.

Constraints

  • 0 <= len(text) <= 10^6
  • 1 <= len(vocab) <= 10^5
  • 1 <= len(vocab[i]) <= 100; the sum of all token lengths does not exceed 10^6
  • All vocabulary tokens are distinct
  • text and all tokens consist of printable ASCII characters

Examples

Input: (["low", "lower", "new", "est", "wid", "er"], "lowernewestwidget")

Expected Output: ["lower", "new", "est", "wid", "g", "e", "t"]

Explanation: At position 0, both 'low' and 'lower' are prefixes; greedy longest-match picks 'lower'. 'new' and 'est' and 'wid' match next. The remaining 'get' has no matching prefix at any position, so 'g', 'e', 't' are emitted as single-character fallbacks.

Input: (["ab", "abc", "cd"], "abcd")

Expected Output: ["abc", "d"]

Explanation: 'abc' is the longest prefix match at position 0, consuming through index 3. That leaves 'd', which matches no token, so it is a fallback. The tokenizer does not backtrack to the alternative ['ab', 'cd'].

Input: (["hello"], "")

Expected Output: []

Explanation: Empty input yields an empty token list.

Input: (["ab"], "xyz")

Expected Output: ["x", "y", "z"]

Explanation: No vocabulary token is a prefix of 'xyz' at any position, so every character is emitted as its own fallback token.

Input: (["a"], "aaa")

Expected Output: ["a", "a", "a"]

Explanation: The single-character token 'a' matches at each position, producing three 'a' tokens.

Input: (["er"], "lowerer")

Expected Output: ["l", "o", "w", "er", "er"]

Explanation: 'l', 'o', 'w' fall back to single characters since no token matches there; then 'er' matches twice for the trailing 'erer'.

Input: (["aa", "aaa", "aaaa"], "aaaaa")

Expected Output: ["aaaa", "a"]

Explanation: Overlapping-length tokens: the longest prefix at position 0 is 'aaaa', leaving a single 'a' which matches no longer token but note only 'aa','aaa','aaaa' exist — the trailing single 'a' has no vocabulary match, so it is a fallback token.

Hints

  1. Build a trie over the vocabulary once. From each starting position, walk the trie character by character and remember the end index of the last position that completed a full vocabulary token — that is the longest match.
  2. If no token completes before the walk stops, emit the single character text[i] as a fallback and advance by 1. Never backtrack: always take the longest match at the current position, then continue from where it ends.
  3. Re-scanning the entire vocabulary at every position is O(len(text) * |vocab| * L) and too slow; the trie collapses shared prefixes so each position costs only O(L).
Last updated: Jul 2, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Compute dasher pay from order events - xAI (nan)
  • Flatten and unflatten nested Python structures - xAI (nan)
  • Compute total active time per Twitter Space - xAI (medium)
  • Design a Recoverable Iterator - xAI (medium)
  • Implement Distributed Matrix Multiplication - xAI (hard)