LLM Fundamentals: Tokenization Design and KL-Regularized SFT
Company: Amazon
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Onsite
# LLM Fundamentals: Tokenization Design and KL-Regularized SFT
You are interviewing for a research/ML role on a team that trains and fine-tunes large language models. The interviewer probes your understanding of two foundational areas: **tokenization** and the **objective used during supervised fine-tuning (SFT)**. These are "why does it work this way" questions — the interviewer wants to see whether you understand the design choices behind standard LLM tooling, not just that you can use it.
### Constraints & Assumptions
- "LLM" here means a decoder-only autoregressive Transformer trained with a next-token prediction objective.
- "Tokenization" refers to the step that converts raw text into the discrete integer ids the model consumes (e.g., subword schemes such as BPE / WordPiece / unigram / byte-level BPE).
- "SFT" refers to supervised fine-tuning: continuing to train a pretrained base model on curated (prompt, response) pairs with a token-level cross-entropy loss on the response tokens.
- Treat vocabulary size, context length, and compute budget as finite and worth economizing.
### Clarifying Questions to Ask
- What **languages and scripts** must the tokenizer cover — English-only, multilingual, code, math, emoji — and does it need to round-trip arbitrary bytes losslessly (no UNK)?
- Are we **training a tokenizer from scratch** on our own corpus, or inheriting a fixed vocabulary from the pretrained base model we are fine-tuning?
- What is the **target vocabulary size and context window**, and which is the binding constraint — embedding/softmax parameters and memory, or sequence length and attention cost?
- For SFT, is the goal pure **instruction-following capability**, or are we also worried about the model **drifting away from the base policy** (catastrophic forgetting, capability or safety regression)?
- Do we have a **frozen reference model** available at fine-tune time (the base or a prior checkpoint) to regularize against, and is the extra forward pass through it within our compute budget?
- How do we **measure success** — held-out cross-entropy / perplexity, downstream task accuracy, or human/preference evaluations — and is staying close to the reference distribution itself a requirement?
### Part 1 — Why subword tokenization
**Why do we tokenize at all, and why use subword tokenization instead of either whole words or raw characters?** Explain what problem tokenization solves for a Transformer LM and the trade-offs that lead practitioners to subword schemes (BPE, byte-level BPE, unigram).
```hint Frame it as two competing costs
A Transformer has a fixed embedding/softmax over a finite vocabulary and pays attention cost that grows with sequence length. Word-level and character-level tokenization sit at opposite ends of one trade-off — think about what each does to vocabulary size, the out-of-vocabulary problem, and sequence length, and why subword lands in the middle.
```
```hint Name the algorithms and how they differ
BPE greedily merges the most frequent adjacent pairs bottom-up; unigram (SentencePiece) starts from a large candidate set and prunes via a probabilistic language-model objective; byte-level BPE runs BPE over raw UTF-8 bytes so the base alphabet is 256 symbols. Be ready to say what each buys you (determinism, lossless coverage, principled segmentation).
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — The 26-letter tokenizer
The interviewer asks a deliberately odd question: **"Could you tokenize using just the 26 letters of the alphabet — i.e., a tiny vocabulary of single characters? Would that work, and what breaks?"** Answer whether it *can* work in principle, then explain the practical costs and what additional issues a literal 26-letter scheme has versus a fully general character-level or byte-level tokenizer.
```hint Separate "works in principle" from "is it a good idea"
A character-level model is a valid, trainable LM — so in principle yes. The interesting answer is the cost: what does shrinking the vocabulary to ~26 symbols do to sequence length, and therefore to attention cost and the effective context you can fit?
```
```hint What does a literal 26-letter set silently drop?
Twenty-six letters cannot represent spaces, digits, punctuation, casing, newlines, non-Latin scripts, emoji, or code symbols. Contrast that with a byte-level tokenizer (256 base symbols) that round-trips any UTF-8 input losslessly with no UNK.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Adding KL to the SFT objective
Switching to fine-tuning: **In SFT, can you add a KL-divergence term to the objective? If so, against what reference, how would you add it, and why might you want to?** Be precise about which distributions the KL is between, what the combined objective looks like, and what behavior the term controls.
```hint Name the two distributions and the reference
The KL is between the *current* policy you are training and a *frozen reference* policy — typically the base/pretrained model (or the checkpoint you started SFT from). You keep a frozen copy, run both forward, and penalize divergence of the next-token distributions.
```
```hint Write the combined objective and say what β does
The loss is the usual token-level cross-entropy on the response tokens *plus* a coefficient times a per-token KL to the reference: roughly L = CE + β·KL(π_θ ‖ π_ref). Be ready to explain what raising or lowering β trades off.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- Tokenization can split a number like `12345` or an identifier like `getUserId` into several pieces in ways that are hard for the model to do arithmetic or code over. Why does this happen, and what tokenizer-side choices (digit-splitting, special handling of whitespace/code) help?
- Subword regularization (e.g. sampling multiple segmentations from a unigram model) is sometimes used at training time. What problem is it trying to solve, and what is the cost at inference?
- The KL term needs the reference model's next-token distribution at every step, which is a second forward pass. How would you make KL-regularized SFT cheaper or approximate the KL, and what do you lose by approximating it?
- Suppose after SFT the model follows instructions well but has clearly forgotten some pretraining knowledge. How would you use the KL coefficient β, data mixing, or the choice of reference checkpoint to diagnose and fix this, and how would you tell over-regularization (β too high) from under-regularization?
Quick Answer: This question evaluates depth of knowledge in large language model fundamentals, specifically subword tokenization design and KL-regularized supervised fine-tuning objectives. It tests conceptual understanding of why these techniques are used in modern LLM training pipelines, commonly asked in machine learning engineering interviews to assess architectural reasoning beyond surface-level API familiarity.