How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Glean.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Glean during technical interviews.

Implement a Byte Pair Encoding (BPE) Tokenizer

Quick Overview

This question evaluates practical understanding of tokenization algorithms and string processing, specifically the iterative merge-based approach used in modern NLP systems. It tests algorithm design, frequency counting, and the ability to implement encode/decode symmetry — skills commonly assessed in coding interviews for roles involving language model infrastructure or search systems.

Implement a Byte Pair Encoding (BPE) Tokenizer

Company: Glean

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

## Implement a Byte Pair Encoding (BPE) Tokenizer Byte Pair Encoding (BPE) is a tokenization algorithm widely used to build the vocabularies of modern language models. It starts from a base vocabulary of single characters and repeatedly merges the most frequent adjacent pair of tokens into a new, single token. Your task is to implement a small BPE tokenizer that learns merge rules from a training string and can then encode and decode text using those rules. Implement a class `BPETokenizer` exposing three methods: **1. `train(text: str, threshold: int) -> None`** Learn merge rules from `text`. - Begin by treating `text` as a sequence of its individual characters. Each distinct character is a base token. - Repeat the following **merge round** until no eligible pair remains: 1. Count how many times each adjacent pair of tokens occurs in the current sequence. Counts are over **non-overlapping left-to-right occurrences**: scan the sequence from left to right, and once a position is consumed as the right element of a counted pair, the next pair starts after it (i.e., in a run like `aaaa`, the pair `(a, a)` is counted twice, not three times). 2. Find the pair with the **highest occurrence count**. If two or more pairs tie for the highest count, choose the pair that is **smallest in lexicographical order**, comparing by the pair's left token string first, then its right token string (where each token's string is the sequence of original characters it expands to). 3. If the highest count is **strictly less than `threshold`**, stop — no more merges are performed. 4. Otherwise, create a new token representing the concatenation of the chosen pair, record this merge rule (in the order it was learned), and replace every non-overlapping left-to-right occurrence of that pair in the current sequence with the new token. Then begin the next round. A new merge token's "string value" is the concatenation of the string values of its two parts. For example, merging `(a, b)` yields a token whose string value is `"ab"`; later merging `(ab, ab)` yields a token whose string value is `"abab"`. **2. `encode(text: str) -> List[int]`** Tokenize `text` into a list of integer token ids using the merge rules learned during `train`, then return the list of ids. - Start from the individual characters of `text`. - Apply the learned merge rules **in the order they were learned during training**: for each merge rule in that order, replace every non-overlapping left-to-right occurrence of that pair in the current sequence with its merged token. (Applying rules in learned order reproduces the same greedy tokenization that training would produce.) - Assign each distinct token a unique, stable integer id (e.g., in the order first encountered). Return the resulting list of token ids. **3. `decode(ids: List[int]) -> str`** Given a list of token ids produced by `encode` (using the same trained tokenizer), reconstruct and return the original string by expanding each token id back into its string value and concatenating the results. For any character in `encode`'s input that was never seen during `train`, treat it as a base token (assign it an id) so it can still be encoded and decoded; it simply participates in no merges. ### Example Training on `text = "abababa"` with `threshold = 2`: - **Round 1:** the sequence is `a b a b a b a`. Adjacent non-overlapping pairs: scanning left-to-right, positions (0,1), (2,3), (4,5) each contribute `(a, b)` — count 3; position (6) is unpaired. No other pair reaches count 1. The max-count pair is `(a, b)` with count 3 (`>= 2`). Merge it. The sequence becomes `ab ab ab a`. - **Round 2:** the sequence is `ab ab ab a`. Scanning left-to-right non-overlapping: positions (0,1) give `(ab, ab)` — count 1; position (2) is now the third `ab`, which pairs with position (3) `a` as `(ab, a)` — count 1. Both pairs tie at count 1 (`>= 2`? No — count 1 < threshold 2). The highest count is 1, which is strictly less than `threshold = 2`, so training stops. The only learned merge rule is `(a, b) -> ab`. After training, `encode("abab")` applies the single learned merge `(a, b) -> ab`, yielding `[ab, ab]`, and returns the corresponding ids; `decode` of that list returns `"abab"`. ### Requirements - `decode(encode(text))` must return `text` exactly, for any `text` (including characters unseen during training). - Tie-breaking during training is by lexicographical order of the pair's `(left_token_string, right_token_string)`, where a token's string is the original characters it expands to. Standard Python string comparison on these expanded strings gives the required order. - The merge counting and replacement use **non-overlapping, left-to-right** scanning, both in `train` and in `encode`. ### Constraints - `1 <= len(text) <= 10^4` for both `train` and `encode` inputs. - `1 <= threshold`. - `text` consists of printable ASCII characters. - The number of merge rules learned is at most a few thousand; an $O(\text{rounds} \times \text{sequence length})$ training loop is acceptable.

Quick Answer: This question evaluates practical understanding of tokenization algorithms and string processing, specifically the iterative merge-based approach used in modern NLP systems. It tests algorithm design, frequency counting, and the ability to implement encode/decode symmetry — skills commonly assessed in coding interviews for roles involving language model infrastructure or search systems.

Byte Pair Encoding (BPE) is the tokenization algorithm behind the vocabularies of modern language models. It starts from a base vocabulary of single characters and repeatedly merges the most frequent adjacent pair of tokens into a new single token. Implement a small BPE tokenizer with `train`, `encode`, and `decode`. **`train(text, threshold)`** — Treat `text` as a sequence of its individual characters (each distinct character is a base token), then repeat a **merge round** until no eligible pair remains: 1. Count each adjacent pair of tokens using **non-overlapping, left-to-right** scanning — once a position is consumed as the right element of a counted pair, the next pair starts after it (so in `aaaa`, `(a,a)` is counted twice, not three times). 2. Take the pair with the **highest count**. On ties, pick the one **smallest in lexicographical order**, comparing the pair's left token string then its right token string (a token's string is the original characters it expands to). 3. If the highest count is **strictly less than `threshold`**, stop. 4. Otherwise create a new token = concatenation of the chosen pair, record the merge rule in learned order, and replace every non-overlapping left-to-right occurrence of that pair with the new token. A merged token's string value is the concatenation of its two parts' string values (e.g. merging `(a,b)` gives `"ab"`; later merging `(ab,ab)` gives `"abab"`). **`encode(text)`** — Start from the individual characters of `text`, then apply the learned merge rules **in learned order** (each replacing every non-overlapping left-to-right occurrence of its pair). Assign each distinct token a stable integer id (in first-encountered order) and return the list of ids. Any character unseen during `train` becomes a fresh base token so it can still be encoded. **`decode(ids)`** — Expand each token id back into its string value and concatenate, reconstructing the original text. `decode(encode(text)) == text` must hold for any text. **Harness entrypoint (`solution`)** — For grading, `solution(train_text, threshold, encode_text)` trains a tokenizer on `train_text` with the given `threshold`, then returns `[merge_rules, decoded]` where `merge_rules` is the list of learned merges as `[left_string, right_string]` pairs (in learned order) and `decoded` is the round-trip `decode(encode(encode_text))`. **Example.** Training on `"abababa"` with `threshold = 2`: Round 1 the sequence is `a b a b a b a`; non-overlapping pairs give `(a,b)` count 3 (>= 2), so merge to `ab`, sequence becomes `ab ab ab a`. Round 2 the best pair count is 1 (< 2), so stop. The only learned rule is `(a,b) -> ab`, and `encode("abab")` yields `[ab, ab]` whose `decode` is `"abab"`.

Constraints

1 <= len(text) <= 10^4 for both train and encode inputs.
1 <= threshold.
text consists of printable ASCII characters.
The number of learned merge rules is at most a few thousand; an O(rounds * sequence length) training loop is acceptable.
Pair counting and replacement are non-overlapping and left-to-right in both train and encode.
Count ties are broken by the smallest (left_token_string, right_token_string) in lexicographical order.
decode(encode(text)) must equal text exactly, including characters unseen during train.

Examples

Input: ("abababa", 2, "abab")

Expected Output: [[["a", "b"]], "abab"]

Explanation: Round 1 counts (a,b)=3 (non-overlapping over a b a b a b a) which is >= 2, so merge (a,b)->ab; sequence becomes ab ab ab a. Round 2's best count is 1 < 2, so training stops with the single rule (a,b)->ab. encode('abab') -> [ab, ab], decode -> 'abab'.

Input: ("aaaa", 2, "aaaa")

Expected Output: [[["a", "a"]], "aaaa"]

Explanation: Non-overlapping counting of a a a a gives (a,a)=2 (positions (0,1) and (2,3)), >= 2, so merge (a,a)->aa; sequence becomes aa aa. The next round counts (aa,aa)=1 < 2, so it stops with one rule. Round-trip of 'aaaa' is 'aaaa'.

Input: ("abc", 1, "abc")

Expected Output: [[["a", "b"], ["ab", "c"]], "abc"]

Explanation: With threshold 1 every count-1 pair is eligible. Round 1 merges (a,b)->ab giving ab c; Round 2 merges (ab,c)->abc giving the single token abc; then no pairs remain. Two rules are learned and the round-trip is exact.

Input: ("abababa", 4, "xyz")

Expected Output: [[], "xyz"]

Explanation: The highest pair count during training is 3, which is < threshold 4, so no merges are ever recorded. encode('xyz') treats x, y, z as fresh base tokens; decode round-trips to 'xyz'. Confirms unseen characters still encode/decode.

Input: ("a", 1, "a")

Expected Output: [[], "a"]

Explanation: A single-character training string has no adjacent pairs, so no merge rule is learned regardless of threshold. The single character round-trips to itself.

Hints

Represent every token by an integer id and keep a table id -> the original-character string it expands to. A merged token's string is just the concatenation of its two parts' strings, so lexicographic tie-breaking is plain string comparison.
Counting must be non-overlapping left-to-right: walk the sequence with an index, and when you count a pair at positions (i, i+1), jump to i+2. The run 'aaaa' therefore yields count 2 for (a,a), not 3.
In each training round, scan all pair counts, keep the max count, and break ties by the smallest (left_string, right_string). Stop the moment the best count is strictly less than threshold.
encode reproduces training's greedy result by replaying the learned merges in order — for each rule, do the same non-overlapping left-to-right replacement. Any character never seen in training just becomes a fresh base token, so decode still round-trips.

Quick Overview

Implement a Byte Pair Encoding (BPE) Tokenizer

Company: Glean

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Constraints

1 <= len(text) <= 10^4 for both train and encode inputs.
1 <= threshold.
text consists of printable ASCII characters.
The number of learned merge rules is at most a few thousand; an O(rounds * sequence length) training loop is acceptable.
Pair counting and replacement are non-overlapping and left-to-right in both train and encode.
Count ties are broken by the smallest (left_token_string, right_token_string) in lexicographical order.
decode(encode(text)) must equal text exactly, including characters unseen during train.

Examples

Input: ("abababa", 2, "abab")

Expected Output: [[["a", "b"]], "abab"]

Input: ("aaaa", 2, "aaaa")

Expected Output: [[["a", "a"]], "aaaa"]

Input: ("abc", 1, "abc")

Expected Output: [[["a", "b"], ["ab", "c"]], "abc"]

Input: ("abababa", 4, "xyz")

Expected Output: [[], "xyz"]

Input: ("a", 1, "a")

Expected Output: [[], "a"]

Explanation: A single-character training string has no adjacent pairs, so no merge rule is learned regardless of threshold. The single character round-trips to itself.

Hints

Represent every token by an integer id and keep a table id -> the original-character string it expands to. A merged token's string is just the concatenation of its two parts' strings, so lexicographic tie-breaking is plain string comparison.
Counting must be non-overlapping left-to-right: walk the sequence with an index, and when you count a pair at positions (i, i+1), jump to i+2. The run 'aaaa' therefore yields count 2 for (a,a), not 3.
In each training round, scan all pair counts, keep the max count, and break ties by the smallest (left_string, right_string). Stop the moment the best count is strictly less than threshold.
encode reproduces training's greedy result by replaying the learned merges in order — for each rule, do the same non-overlapping left-to-right replacement. Any character never seen in training just becomes a fresh base token, so decode still round-trips.