Explain Top-k and Top-p Decoding
Company: Cohere
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Onsite
In autoregressive text generation, a language model produces a vector of logits for the next token at each step. These logits are converted to a probability distribution over the vocabulary (typically via a temperature-scaled softmax), and a *decoding strategy* chooses which token to emit. **Greedy decoding** always selects the single most likely token. **Stochastic** strategies instead sample from a *restricted* subset of likely tokens, trading off coherence against diversity.
Two of the most common truncation-based sampling strategies are **top-k decoding** (keep the $k$ highest-probability tokens, renormalize, then sample) and **top-p / nucleus decoding** (keep the smallest set of tokens whose cumulative probability reaches a threshold $p$, renormalize, then sample).
You are interviewing for an ML role. Answer the following about these two strategies, with concrete reasoning about *why* each behavior occurs.
### Constraints & Assumptions
- Focus on the **next-token sampling step** of an already-trained autoregressive LM; you do not need to discuss training, beam search, or model architecture.
- Assume softmax probabilities $p_1 \ge p_2 \ge \dots \ge p_V$ over a vocabulary of size $V$, optionally after temperature scaling.
- Where helpful, reason about two regimes: a **peaked** (low-entropy / confident) distribution and a **flat** (high-entropy / uncertain) distribution.
### Clarifying Questions to Ask
- Is temperature applied before or after the truncation step, and is it held fixed across the comparison? (It changes the effective sharpness of the distribution and interacts with both $k$ and $p$.)
- Are we evaluating for a single task/style (e.g., factual QA vs. open-ended creative writing), or comparing across tasks? The "right" decoding choice is task-dependent.
- Should the answer assume well-calibrated probabilities, or account for the fact that the model's tail probabilities may be unreliable?
- Are top-k and top-p meant to be used in isolation, or can they be combined (and with temperature)?
### Part 1
What are the main flaws of **top-k decoding**?
```hint Where to start
The key property of top-k is that $k$ is a **fixed, distribution-independent** count. Ask what goes wrong when the *shape* of the distribution changes from step to step but $k$ does not.
```
```hint Two failure regimes
Contrast a sharply **peaked** distribution (one or two dominant tokens) with a nearly **flat** one. In which regime does a fixed $k$ admit junk tail tokens, and in which does it cut off plausible ones?
```
#### What This Part Should Cover
- That a fixed $k$ is insensitive to the entropy/shape of the per-step distribution.
- The peaked-distribution failure: $k$ too large admits low-probability tail tokens that can still be sampled after renormalization.
- The flat-distribution failure: $k$ too small excludes many near-equally-plausible tokens, reducing diversity.
- The arbitrary, hard cutoff at the boundary (the $k$-th vs. $(k{+}1)$-th token may be nearly tied) and the need to tune $k$ per model/task/temperature.
### Part 2
What is **top-p (nucleus) decoding**? Describe the algorithm precisely.
```hint Mechanism
Instead of a fixed *count*, top-p fixes a cumulative *probability mass*. Sort tokens by probability and walk down until the running sum first reaches $p$ — that prefix is the "nucleus."
```
#### What This Part Should Cover
- The precise per-step procedure: softmax → sort descending → take the smallest prefix whose cumulative mass $\ge p$ → renormalize over that nucleus → sample.
- That the nucleus size is **dynamic** and varies step to step.
- The role of $p$ (e.g., $p = 0.9$) and how the renormalization step works.
- A correct worked intuition (small example) showing which tokens fall inside vs. outside the nucleus.
### Part 3
How does top-p decoding **address some of the weaknesses of top-k** decoding?
```hint Link back to Part 1
Map each fixed-$k$ failure from Part 1 onto what top-p does in the same regime. What happens to the nucleus size when the distribution is peaked vs. flat?
```
#### What This Part Should Cover
- That top-p adapts the candidate-set size to the model's per-step uncertainty (small nucleus when peaked, large when flat).
- How this fixes the peaked-distribution junk-token problem and the flat-distribution truncation problem from Part 1.
- An honest statement of what top-p does **not** fix (still a hyperparameter; still sensitive to miscalibrated tail probabilities; a hard cutoff at the mass boundary remains).
### What a Strong Answer Covers
Across all parts, a strong candidate connects the mechanics to behavior rather than reciting definitions. Specifically:
- Frames both methods as **truncation + renormalize + sample**, differing only in *how* the truncation set is chosen (fixed count vs. cumulative mass).
- Reasons explicitly about the **peaked vs. flat** regimes and uses them consistently across all three parts.
- Distinguishes "probabilities are arbitrary cut" (both methods) from "the cut location adapts to the distribution" (only top-p).
- Notes the interaction with **temperature** and **calibration**, and avoids overclaiming that top-p is strictly superior.
### Follow-up Questions
- How do **temperature** and top-p interact? If you raise temperature, what should you do to $p$ to keep generation quality roughly stable?
- Many systems combine top-k and top-p (apply both filters). Why might you want both, and in what order would you apply them?
- Top-p still has a hard cutoff at the cumulative-mass boundary. What alternative (e.g., min-p or typical/locally-typical sampling, or temperature-only) might address the residual "arbitrary boundary" problem, and what is its tradeoff?
- How would you empirically choose $p$ for a factual question-answering system versus an open-ended story generator, and what failure modes would you watch for at each extreme ($p \to 0$ and $p \to 1$)?
Quick Answer: This question evaluates understanding of autoregressive decoding strategies in probabilistic language models, assessing competency in reasoning about sampling-based token selection, probability distribution behavior, and trade-offs between diversity and likelihood.