Design and Implement a Trie-Based Subword Tokenizer for LLM Pretraining
Context
You are building a subword tokenizer for a large-scale LLM pretraining pipeline. The tokenizer must be deterministic, fast, memory-efficient, Unicode-safe, and production-ready.
Requirements
-
API
-
build(vocab, config) -> Tokenizer
-
tokenize(text) -> List[int] (token IDs)
-
detokenize(ids) -> str (text)
-
Optional: encode(text) -> List[str] (pieces), decode(pieces) -> str
-
Core Algorithm
-
Implement greedy longest-prefix matching using a prefix trie over a fixed vocabulary of subword strings.
-
Match over Unicode code points (not bytes) to correctly handle multi-byte UTF-8, emojis, and CJK.
-
Unicode Handling
-
Correctly process multi-byte UTF-8, emojis (including ZWJ sequences and variation selectors), CJK, RTL scripts, combining marks.
-
Specify whether matching operates on bytes, code points, or grapheme clusters (and why).
-
Normalization and Casing
-
Define normalization choices (e.g., NFKC). Discuss reversibility impact.
-
Define casing policy (e.g., preserve case vs. lowercasing). Call out implications for detokenization fidelity.
-
Whitespace and Punctuation
-
Define rules (e.g., SentencePiece-style "▁" for spaces or GPT-style leading-space tokens).
-
Clarify how multiple spaces, tabs, newlines, and punctuation are tokenized.
-
Unknown-Token Fallback
-
Provide a robust fallback when no subword matches at a position.
-
Options: dedicated
<unk>
vs. byte-level fallback to ensure total coverage and reversibility.
-
Complexity Analysis
-
Time and space complexity for build, tokenize, detokenize.
-
Practical memory estimates and throughput considerations.
-
Incremental Vocabulary Updates
-
How to add tokens incrementally without breaking determinism.
-
Versioning, hot-swap strategies, and implications for previously tokenized data.
-
Testing Strategy
-
Propose comprehensive tests for tricky inputs: emojis (ZWJ/skin tones), CJK without spaces, combining marks, RTL, invalid UTF-8, whitespace runs, URLs, code, etc.
-
Comparison to Alternatives
-
Compare this trie-based approach to BPE/WordPiece/Unigram.
-
Discuss trade-offs for throughput and memory in production.