How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Onsite rounds at Amazon.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Amazon during technical interviews.

LLM Fundamentals: Tokenization Design and KL-Regularized SFT

Q: LLM Fundamentals: Tokenization Design and KL-Regularized SFT

This question evaluates depth of knowledge in large language model fundamentals, specifically subword tokenization design and KL-regularized supervised fine-tuning objectives. It tests conceptual understanding of why these techniques are used in modern LLM training pipelines, commonly asked in machine learning engineering interviews to assess architectural reasoning beyond surface-level API familiarity.

LLM Fundamentals: Tokenization Design and KL-Regularized SFT

You are interviewing for a research/ML role on a team that trains and fine-tunes large language models. The interviewer probes your understanding of two foundational areas: tokenization and the objective used during supervised fine-tuning (SFT). These are "why does it work this way" questions — the interviewer wants to see whether you understand the design choices behind standard LLM tooling, not just that you can use it.

Constraints & Assumptions

"LLM" here means a decoder-only autoregressive Transformer trained with a next-token prediction objective.
"Tokenization" refers to the step that converts raw text into the discrete integer ids the model consumes (e.g., subword schemes such as BPE / WordPiece / unigram / byte-level BPE).
"SFT" refers to supervised fine-tuning: continuing to train a pretrained base model on curated (prompt, response) pairs with a token-level cross-entropy loss on the response tokens.
Treat vocabulary size, context length, and compute budget as finite and worth economizing.

Clarifying Questions to Ask

What languages and scripts must the tokenizer cover — English-only, multilingual, code, math, emoji — and does it need to round-trip arbitrary bytes losslessly (no UNK)?
Are we training a tokenizer from scratch on our own corpus, or inheriting a fixed vocabulary from the pretrained base model we are fine-tuning?
What is the target vocabulary size and context window , and which is the binding constraint — embedding/softmax parameters and memory, or sequence length and attention cost?
For SFT, is the goal pure instruction-following capability , or are we also worried about the model drifting away from the base policy (catastrophic forgetting, capability or safety regression)?
Do we have a frozen reference model available at fine-tune time (the base or a prior checkpoint) to regularize against, and is the extra forward pass through it within our compute budget?
How do we measure success — held-out cross-entropy / perplexity, downstream task accuracy, or human/preference evaluations — and is staying close to the reference distribution itself a requirement?

Part 1 — Why subword tokenization

Why do we tokenize at all, and why use subword tokenization instead of either whole words or raw characters? Explain what problem tokenization solves for a Transformer LM and the trade-offs that lead practitioners to subword schemes (BPE, byte-level BPE, unigram).

What This Part Should Cover Premium

Part 2 — The 26-letter tokenizer

The interviewer asks a deliberately odd question: "Could you tokenize using just the 26 letters of the alphabet — i.e., a tiny vocabulary of single characters? Would that work, and what breaks?" Answer whether it can work in principle, then explain the practical costs and what additional issues a literal 26-letter scheme has versus a fully general character-level or byte-level tokenizer.

What This Part Should Cover Premium

Part 3 — Adding KL to the SFT objective

Switching to fine-tuning: In SFT, can you add a KL-divergence term to the objective? If so, against what reference, how would you add it, and why might you want to? Be precise about which distributions the KL is between, what the combined objective looks like, and what behavior the term controls.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Tokenization can split a number like 12345 or an identifier like getUserId into several pieces in ways that are hard for the model to do arithmetic or code over. Why does this happen, and what tokenizer-side choices (digit-splitting, special handling of whitespace/code) help?
Subword regularization (e.g. sampling multiple segmentations from a unigram model) is sometimes used at training time. What problem is it trying to solve, and what is the cost at inference?
The KL term needs the reference model's next-token distribution at every step, which is a second forward pass. How would you make KL-regularized SFT cheaper or approximate the KL, and what do you lose by approximating it?
Suppose after SFT the model follows instructions well but has clearly forgotten some pretraining knowledge. How would you use the KL coefficient β, data mixing, or the choice of reference checkpoint to diagnose and fix this, and how would you tell over-regularization (β too high) from under-regularization?

LLM Fundamentals: Tokenization Design and KL-Regularized SFT

Constraints & Assumptions

"LLM" here means a decoder-only autoregressive Transformer trained with a next-token prediction objective.
"Tokenization" refers to the step that converts raw text into the discrete integer ids the model consumes (e.g., subword schemes such as BPE / WordPiece / unigram / byte-level BPE).
"SFT" refers to supervised fine-tuning: continuing to train a pretrained base model on curated (prompt, response) pairs with a token-level cross-entropy loss on the response tokens.
Treat vocabulary size, context length, and compute budget as finite and worth economizing.

Clarifying Questions to Ask

What languages and scripts must the tokenizer cover — English-only, multilingual, code, math, emoji — and does it need to round-trip arbitrary bytes losslessly (no UNK)?
Are we training a tokenizer from scratch on our own corpus, or inheriting a fixed vocabulary from the pretrained base model we are fine-tuning?
What is the target vocabulary size and context window , and which is the binding constraint — embedding/softmax parameters and memory, or sequence length and attention cost?
For SFT, is the goal pure instruction-following capability , or are we also worried about the model drifting away from the base policy (catastrophic forgetting, capability or safety regression)?
Do we have a frozen reference model available at fine-tune time (the base or a prior checkpoint) to regularize against, and is the extra forward pass through it within our compute budget?
How do we measure success — held-out cross-entropy / perplexity, downstream task accuracy, or human/preference evaluations — and is staying close to the reference distribution itself a requirement?

Part 1 — Why subword tokenization

What This Part Should Cover Premium

Part 2 — The 26-letter tokenizer

What This Part Should Cover Premium

Part 3 — Adding KL to the SFT objective

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Tokenization can split a number like 12345 or an identifier like getUserId into several pieces in ways that are hard for the model to do arithmetic or code over. Why does this happen, and what tokenizer-side choices (digit-splitting, special handling of whitespace/code) help?
Subword regularization (e.g. sampling multiple segmentations from a unigram model) is sometimes used at training time. What problem is it trying to solve, and what is the cost at inference?
The KL term needs the reference model's next-token distribution at every step, which is a second forward pass. How would you make KL-regularized SFT cheaper or approximate the KL, and what do you lose by approximating it?
Suppose after SFT the model follows instructions well but has clearly forgotten some pretraining knowledge. How would you use the KL coefficient β, data mixing, or the choice of reference checkpoint to diagnose and fix this, and how would you tell over-regularization (β too high) from under-regularization?

LLM Fundamentals: Tokenization Design and KL-Regularized SFT

Quick Overview

LLM Fundamentals: Tokenization Design and KL-Regularized SFT

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Why subword tokenization

What This Part Should Cover Premium

Part 2 — The 26-letter tokenizer

What This Part Should Cover Premium

Part 3 — Adding KL to the SFT objective

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

LLM Fundamentals: Tokenization Design and KL-Regularized SFT

Quick Overview

LLM Fundamentals: Tokenization Design and KL-Regularized SFT

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Why subword tokenization

What This Part Should Cover Premium

Part 2 — The 26-letter tokenizer

What This Part Should Cover Premium

Part 3 — Adding KL to the SFT objective

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP