PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates string-processing and algorithmic implementation skills for greedy longest-match tokenization, including handling runs of unmatched characters and reducing unnecessary comparisons when the vocabulary is small.

  • medium
  • Anthropic
  • Coding & Algorithms
  • Software Engineer

Implement a longest-match tokenizer

Company: Anthropic

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

You are given: - A vocabulary that maps strings to integer token IDs. - An input string `text`. Tokenize `text` from left to right using greedy longest-match rules: 1. At each position, choose the longest vocabulary entry that matches the substring starting at that position. 2. Output its token ID and advance by the matched length. 3. If no vocabulary entry matches at the current position, output `-1` and advance by one character. Follow-up questions: 1. Modify the output so that consecutive unmatched characters are represented by a single `-1` instead of repeated `-1` values. 2. If the vocabulary is very small, how can you reduce unnecessary comparisons in a simple two-loop solution without using a trie? For example, if unmatched characters appear in a row, the follow-up behavior should emit one `-1` for the whole unmatched run.

Quick Answer: This question evaluates string-processing and algorithmic implementation skills for greedy longest-match tokenization, including handling runs of unmatched characters and reducing unnecessary comparisons when the vocabulary is small.

Part 1: Implement greedy longest-match tokenizer

You are given a vocabulary that maps non-empty strings to integer token IDs, and a string text. Tokenize text from left to right using greedy longest-match rules. At each position, choose the longest vocabulary entry that matches the substring starting there. Output its token ID and advance by the matched length. If no vocabulary entry matches at the current position, output -1 and advance by one character. Return the full list of token IDs.

Constraints

  • 0 <= len(text) <= 10^4
  • 0 <= len(vocab) <= 10^3
  • All vocabulary keys are unique, non-empty strings
  • Token IDs are integers

Examples

Input: ({'a': 1, 'ab': 2, 'abc': 3, 'bc': 4}, 'abcabx')

Expected Output: [3, 2, -1]

Explanation: At index 0, 'abc' is the longest match. Then 'ab' matches at index 3. 'x' is unmatched.

Input: ({'cat': 10, 'c': 1, 'at': 2}, 'catat')

Expected Output: [10, 2]

Explanation: 'cat' is chosen over 'c' at the start because it is longer. Then 'at' matches.

Input: ({'ab': 1, 'aba': 2, 'ba': 3}, 'ababa')

Expected Output: [2, 3]

Explanation: 'aba' matches first, then 'ba' matches the remaining suffix.

Input: ({'a': 1}, '')

Expected Output: []

Explanation: Empty text produces no tokens.

Input: ({}, 'hi')

Expected Output: [-1, -1]

Explanation: With an empty vocabulary, each character is unmatched.

Hints

  1. If you check vocabulary entries in descending order of length, the first match you find at a position is the greedy longest match.
  2. When nothing matches, append -1 and move forward by exactly one character.

Part 2: Collapse consecutive unmatched characters into one -1

You are given a vocabulary that maps non-empty strings to integer token IDs, and a string text. Tokenize text from left to right using greedy longest-match rules. At each position, choose the longest vocabulary entry that matches the substring starting there. Output its token ID and advance by the matched length. If no vocabulary entry matches, treat that character as unmatched and advance by one character. However, if multiple unmatched characters occur consecutively, output only one -1 for the entire unmatched run.

Constraints

  • 0 <= len(text) <= 10^4
  • 0 <= len(vocab) <= 10^3
  • All vocabulary keys are unique, non-empty strings
  • Token IDs are integers

Examples

Input: ({'ab': 1, 'c': 2}, 'abxxczzz')

Expected Output: [1, -1, 2, -1]

Explanation: 'ab' matches, then 'xx' is one unmatched run, then 'c' matches, then 'zzz' is another unmatched run.

Input: ({'a': 1, 'aa': 2}, 'bbaaaac')

Expected Output: [-1, 2, 2, -1]

Explanation: 'bb' becomes one -1, then 'aa' and 'aa' match greedily, and the final 'c' becomes one -1.

Input: ({'xyz': 7}, 'abc')

Expected Output: [-1]

Explanation: All characters are unmatched, so they collapse into one -1.

Input: ({'a': 1}, '')

Expected Output: []

Explanation: Empty text produces no tokens.

Input: ({}, 'aab')

Expected Output: [-1]

Explanation: With an empty vocabulary, the whole text is one unmatched run.

Hints

  1. Keep a flag that remembers whether you are currently inside an unmatched run.
  2. Any successful token match should end an unmatched run.

Part 3: Optimize longest-match tokenization for a small vocabulary

Implement the same greedy longest-match tokenizer as in Part 1, but reduce unnecessary comparisons without using a trie. Preprocess the vocabulary by grouping entries by their first character. Then, at each position in text, only compare against vocabulary entries whose first character matches text[i]. Within each group, preserve greedy longest-match behavior by trying longer strings first. If nothing matches at the current position, output -1 and advance by one character.

Constraints

  • 0 <= len(text) <= 10^4
  • 0 <= len(vocab) <= 10^3
  • All vocabulary keys are unique, non-empty strings
  • Do not use a trie
  • Token IDs are integers

Examples

Input: ({'the': 1, 'th': 2, 'he': 3, 'a': 4}, 'thea!')

Expected Output: [1, 4, -1]

Explanation: At the start, only 'the' and 'th' need to be checked because text begins with 't'. 'the' is the greedy match.

Input: ({'x': 5, 'xy': 6, 'xyz': 7, 'ab': 1}, 'xyzxaby')

Expected Output: [7, 5, 1, -1]

Explanation: 'xyz' matches first, then 'x', then 'ab', and finally 'y' is unmatched.

Input: ({'aa': 1, 'ab': 2, 'b': 3}, 'abbaa')

Expected Output: [2, 3, 1]

Explanation: 'ab' matches first, then 'b', then 'aa'. Grouping by first character avoids checking impossible candidates.

Input: ({'a': 1}, '')

Expected Output: []

Explanation: Empty text produces no tokens.

Input: ({}, 'ab')

Expected Output: [-1, -1]

Explanation: With no vocabulary entries, every character is unmatched.

Hints

  1. A vocabulary word can only match at position i if its first character equals text[i].
  2. Sort each first-character group by descending length so the first successful match is still the greedy one.
Last updated: Apr 26, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement a Banking System - Anthropic (medium)
  • Implement Persistent Memoization LRU Cache - Anthropic (hard)
  • Fix a Corrupted Bootloader Instruction - Anthropic (medium)
  • Implement a Time-Aware Task Manager - Anthropic (hard)
  • Implement a Simplified DNS Resolver - Anthropic (hard)