PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Amazon

LLM Fundamentals: Tokenization Design and KL-Regularized SFT

Last updated: Jun 24, 2026

Quick Overview

This question evaluates depth of knowledge in large language model fundamentals, specifically subword tokenization design and KL-regularized supervised fine-tuning objectives. It tests conceptual understanding of why these techniques are used in modern LLM training pipelines, commonly asked in machine learning engineering interviews to assess architectural reasoning beyond surface-level API familiarity.

  • medium
  • Amazon
  • Machine Learning
  • Machine Learning Engineer

LLM Fundamentals: Tokenization Design and KL-Regularized SFT

Company: Amazon

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Onsite

# LLM Fundamentals: Tokenization Design and KL-Regularized SFT You are interviewing for a research/ML role on a team that trains and fine-tunes large language models. The interviewer probes your understanding of two foundational areas: **tokenization** and the **objective used during supervised fine-tuning (SFT)**. These are "why does it work this way" questions — the interviewer wants to see whether you understand the design choices behind standard LLM tooling, not just that you can use it. ### Constraints & Assumptions - "LLM" here means a decoder-only autoregressive Transformer trained with a next-token prediction objective. - "Tokenization" refers to the step that converts raw text into the discrete integer ids the model consumes (e.g., subword schemes such as BPE / WordPiece / unigram / byte-level BPE). - "SFT" refers to supervised fine-tuning: continuing to train a pretrained base model on curated (prompt, response) pairs with a token-level cross-entropy loss on the response tokens. - Treat vocabulary size, context length, and compute budget as finite and worth economizing. ### Clarifying Questions to Ask - What **languages and scripts** must the tokenizer cover — English-only, multilingual, code, math, emoji — and does it need to round-trip arbitrary bytes losslessly (no UNK)? - Are we **training a tokenizer from scratch** on our own corpus, or inheriting a fixed vocabulary from the pretrained base model we are fine-tuning? - What is the **target vocabulary size and context window**, and which is the binding constraint — embedding/softmax parameters and memory, or sequence length and attention cost? - For SFT, is the goal pure **instruction-following capability**, or are we also worried about the model **drifting away from the base policy** (catastrophic forgetting, capability or safety regression)? - Do we have a **frozen reference model** available at fine-tune time (the base or a prior checkpoint) to regularize against, and is the extra forward pass through it within our compute budget? - How do we **measure success** — held-out cross-entropy / perplexity, downstream task accuracy, or human/preference evaluations — and is staying close to the reference distribution itself a requirement? ### Part 1 — Why subword tokenization **Why do we tokenize at all, and why use subword tokenization instead of either whole words or raw characters?** Explain what problem tokenization solves for a Transformer LM and the trade-offs that lead practitioners to subword schemes (BPE, byte-level BPE, unigram). ```hint Frame it as two competing costs A Transformer has a fixed embedding/softmax over a finite vocabulary and pays attention cost that grows with sequence length. Word-level and character-level tokenization sit at opposite ends of one trade-off — think about what each does to vocabulary size, the out-of-vocabulary problem, and sequence length, and why subword lands in the middle. ``` ```hint Name the algorithms and how they differ BPE greedily merges the most frequent adjacent pairs bottom-up; unigram (SentencePiece) starts from a large candidate set and prunes via a probabilistic language-model objective; byte-level BPE runs BPE over raw UTF-8 bytes so the base alphabet is 256 symbols. Be ready to say what each buys you (determinism, lossless coverage, principled segmentation). ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 — The 26-letter tokenizer The interviewer asks a deliberately odd question: **"Could you tokenize using just the 26 letters of the alphabet — i.e., a tiny vocabulary of single characters? Would that work, and what breaks?"** Answer whether it *can* work in principle, then explain the practical costs and what additional issues a literal 26-letter scheme has versus a fully general character-level or byte-level tokenizer. ```hint Separate "works in principle" from "is it a good idea" A character-level model is a valid, trainable LM — so in principle yes. The interesting answer is the cost: what does shrinking the vocabulary to ~26 symbols do to sequence length, and therefore to attention cost and the effective context you can fit? ``` ```hint What does a literal 26-letter set silently drop? Twenty-six letters cannot represent spaces, digits, punctuation, casing, newlines, non-Latin scripts, emoji, or code symbols. Contrast that with a byte-level tokenizer (256 base symbols) that round-trips any UTF-8 input losslessly with no UNK. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 — Adding KL to the SFT objective Switching to fine-tuning: **In SFT, can you add a KL-divergence term to the objective? If so, against what reference, how would you add it, and why might you want to?** Be precise about which distributions the KL is between, what the combined objective looks like, and what behavior the term controls. ```hint Name the two distributions and the reference The KL is between the *current* policy you are training and a *frozen reference* policy — typically the base/pretrained model (or the checkpoint you started SFT from). You keep a frozen copy, run both forward, and penalize divergence of the next-token distributions. ``` ```hint Write the combined objective and say what β does The loss is the usual token-level cross-entropy on the response tokens *plus* a coefficient times a per-token KL to the reference: roughly L = CE + β·KL(π_θ ‖ π_ref). Be ready to explain what raising or lowering β trades off. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - Tokenization can split a number like `12345` or an identifier like `getUserId` into several pieces in ways that are hard for the model to do arithmetic or code over. Why does this happen, and what tokenizer-side choices (digit-splitting, special handling of whitespace/code) help? - Subword regularization (e.g. sampling multiple segmentations from a unigram model) is sometimes used at training time. What problem is it trying to solve, and what is the cost at inference? - The KL term needs the reference model's next-token distribution at every step, which is a second forward pass. How would you make KL-regularized SFT cheaper or approximate the KL, and what do you lose by approximating it? - Suppose after SFT the model follows instructions well but has clearly forgotten some pretraining knowledge. How would you use the KL coefficient β, data mixing, or the choice of reference checkpoint to diagnose and fix this, and how would you tell over-regularization (β too high) from under-regularization?

Quick Answer: This question evaluates depth of knowledge in large language model fundamentals, specifically subword tokenization design and KL-regularized supervised fine-tuning objectives. It tests conceptual understanding of why these techniques are used in modern LLM training pipelines, commonly asked in machine learning engineering interviews to assess architectural reasoning beyond surface-level API familiarity.

Related Interview Questions

  • Predicting the Next Elevator Call Location - Amazon (medium)
  • Explain Transformer and MoE Fundamentals - Amazon (medium)
  • Explain Core ML Interview Concepts - Amazon (hard)
  • Evaluate NLP Classification Models - Amazon (easy)
  • Explain overfitting, regularization, and LLM techniques - Amazon (medium)
Amazon logo
Amazon
Jun 18, 2026, 12:00 AM
Machine Learning Engineer
Onsite
Machine Learning
0
0

LLM Fundamentals: Tokenization Design and KL-Regularized SFT

You are interviewing for a research/ML role on a team that trains and fine-tunes large language models. The interviewer probes your understanding of two foundational areas: tokenization and the objective used during supervised fine-tuning (SFT). These are "why does it work this way" questions — the interviewer wants to see whether you understand the design choices behind standard LLM tooling, not just that you can use it.

Constraints & Assumptions

  • "LLM" here means a decoder-only autoregressive Transformer trained with a next-token prediction objective.
  • "Tokenization" refers to the step that converts raw text into the discrete integer ids the model consumes (e.g., subword schemes such as BPE / WordPiece / unigram / byte-level BPE).
  • "SFT" refers to supervised fine-tuning: continuing to train a pretrained base model on curated (prompt, response) pairs with a token-level cross-entropy loss on the response tokens.
  • Treat vocabulary size, context length, and compute budget as finite and worth economizing.

Clarifying Questions to Ask

  • What languages and scripts must the tokenizer cover — English-only, multilingual, code, math, emoji — and does it need to round-trip arbitrary bytes losslessly (no UNK)?
  • Are we training a tokenizer from scratch on our own corpus, or inheriting a fixed vocabulary from the pretrained base model we are fine-tuning?
  • What is the target vocabulary size and context window , and which is the binding constraint — embedding/softmax parameters and memory, or sequence length and attention cost?
  • For SFT, is the goal pure instruction-following capability , or are we also worried about the model drifting away from the base policy (catastrophic forgetting, capability or safety regression)?
  • Do we have a frozen reference model available at fine-tune time (the base or a prior checkpoint) to regularize against, and is the extra forward pass through it within our compute budget?
  • How do we measure success — held-out cross-entropy / perplexity, downstream task accuracy, or human/preference evaluations — and is staying close to the reference distribution itself a requirement?

Part 1 — Why subword tokenization

Why do we tokenize at all, and why use subword tokenization instead of either whole words or raw characters? Explain what problem tokenization solves for a Transformer LM and the trade-offs that lead practitioners to subword schemes (BPE, byte-level BPE, unigram).

What This Part Should Cover Premium

Part 2 — The 26-letter tokenizer

The interviewer asks a deliberately odd question: "Could you tokenize using just the 26 letters of the alphabet — i.e., a tiny vocabulary of single characters? Would that work, and what breaks?" Answer whether it can work in principle, then explain the practical costs and what additional issues a literal 26-letter scheme has versus a fully general character-level or byte-level tokenizer.

What This Part Should Cover Premium

Part 3 — Adding KL to the SFT objective

Switching to fine-tuning: In SFT, can you add a KL-divergence term to the objective? If so, against what reference, how would you add it, and why might you want to? Be precise about which distributions the KL is between, what the combined objective looks like, and what behavior the term controls.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • Tokenization can split a number like 12345 or an identifier like getUserId into several pieces in ways that are hard for the model to do arithmetic or code over. Why does this happen, and what tokenizer-side choices (digit-splitting, special handling of whitespace/code) help?
  • Subword regularization (e.g. sampling multiple segmentations from a unigram model) is sometimes used at training time. What problem is it trying to solve, and what is the cost at inference?
  • The KL term needs the reference model's next-token distribution at every step, which is a second forward pass. How would you make KL-regularized SFT cheaper or approximate the KL, and what do you lose by approximating it?
  • Suppose after SFT the model follows instructions well but has clearly forgotten some pretraining knowledge. How would you use the KL coefficient β, data mixing, or the choice of reference checkpoint to diagnose and fix this, and how would you tell over-regularization (β too high) from under-regularization?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Amazon•More Machine Learning Engineer•Amazon Machine Learning Engineer•Amazon Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.