PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Cohere

Explain Top-k and Top-p Decoding

Last updated: Jun 24, 2026

Quick Overview

This question evaluates understanding of autoregressive decoding strategies in probabilistic language models, assessing competency in reasoning about sampling-based token selection, probability distribution behavior, and trade-offs between diversity and likelihood.

  • medium
  • Cohere
  • Machine Learning
  • Machine Learning Engineer

Explain Top-k and Top-p Decoding

Company: Cohere

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Onsite

In autoregressive text generation, a language model produces a vector of logits for the next token at each step. These logits are converted to a probability distribution over the vocabulary (typically via a temperature-scaled softmax), and a *decoding strategy* chooses which token to emit. **Greedy decoding** always selects the single most likely token. **Stochastic** strategies instead sample from a *restricted* subset of likely tokens, trading off coherence against diversity. Two of the most common truncation-based sampling strategies are **top-k decoding** (keep the $k$ highest-probability tokens, renormalize, then sample) and **top-p / nucleus decoding** (keep the smallest set of tokens whose cumulative probability reaches a threshold $p$, renormalize, then sample). You are interviewing for an ML role. Answer the following about these two strategies, with concrete reasoning about *why* each behavior occurs. ### Constraints & Assumptions - Focus on the **next-token sampling step** of an already-trained autoregressive LM; you do not need to discuss training, beam search, or model architecture. - Assume softmax probabilities $p_1 \ge p_2 \ge \dots \ge p_V$ over a vocabulary of size $V$, optionally after temperature scaling. - Where helpful, reason about two regimes: a **peaked** (low-entropy / confident) distribution and a **flat** (high-entropy / uncertain) distribution. ### Clarifying Questions to Ask - Is temperature applied before or after the truncation step, and is it held fixed across the comparison? (It changes the effective sharpness of the distribution and interacts with both $k$ and $p$.) - Are we evaluating for a single task/style (e.g., factual QA vs. open-ended creative writing), or comparing across tasks? The "right" decoding choice is task-dependent. - Should the answer assume well-calibrated probabilities, or account for the fact that the model's tail probabilities may be unreliable? - Are top-k and top-p meant to be used in isolation, or can they be combined (and with temperature)? ### Part 1 What are the main flaws of **top-k decoding**? ```hint Where to start The key property of top-k is that $k$ is a **fixed, distribution-independent** count. Ask what goes wrong when the *shape* of the distribution changes from step to step but $k$ does not. ``` ```hint Two failure regimes Contrast a sharply **peaked** distribution (one or two dominant tokens) with a nearly **flat** one. In which regime does a fixed $k$ admit junk tail tokens, and in which does it cut off plausible ones? ``` #### What This Part Should Cover - That a fixed $k$ is insensitive to the entropy/shape of the per-step distribution. - The peaked-distribution failure: $k$ too large admits low-probability tail tokens that can still be sampled after renormalization. - The flat-distribution failure: $k$ too small excludes many near-equally-plausible tokens, reducing diversity. - The arbitrary, hard cutoff at the boundary (the $k$-th vs. $(k{+}1)$-th token may be nearly tied) and the need to tune $k$ per model/task/temperature. ### Part 2 What is **top-p (nucleus) decoding**? Describe the algorithm precisely. ```hint Mechanism Instead of a fixed *count*, top-p fixes a cumulative *probability mass*. Sort tokens by probability and walk down until the running sum first reaches $p$ — that prefix is the "nucleus." ``` #### What This Part Should Cover - The precise per-step procedure: softmax → sort descending → take the smallest prefix whose cumulative mass $\ge p$ → renormalize over that nucleus → sample. - That the nucleus size is **dynamic** and varies step to step. - The role of $p$ (e.g., $p = 0.9$) and how the renormalization step works. - A correct worked intuition (small example) showing which tokens fall inside vs. outside the nucleus. ### Part 3 How does top-p decoding **address some of the weaknesses of top-k** decoding? ```hint Link back to Part 1 Map each fixed-$k$ failure from Part 1 onto what top-p does in the same regime. What happens to the nucleus size when the distribution is peaked vs. flat? ``` #### What This Part Should Cover - That top-p adapts the candidate-set size to the model's per-step uncertainty (small nucleus when peaked, large when flat). - How this fixes the peaked-distribution junk-token problem and the flat-distribution truncation problem from Part 1. - An honest statement of what top-p does **not** fix (still a hyperparameter; still sensitive to miscalibrated tail probabilities; a hard cutoff at the mass boundary remains). ### What a Strong Answer Covers Across all parts, a strong candidate connects the mechanics to behavior rather than reciting definitions. Specifically: - Frames both methods as **truncation + renormalize + sample**, differing only in *how* the truncation set is chosen (fixed count vs. cumulative mass). - Reasons explicitly about the **peaked vs. flat** regimes and uses them consistently across all three parts. - Distinguishes "probabilities are arbitrary cut" (both methods) from "the cut location adapts to the distribution" (only top-p). - Notes the interaction with **temperature** and **calibration**, and avoids overclaiming that top-p is strictly superior. ### Follow-up Questions - How do **temperature** and top-p interact? If you raise temperature, what should you do to $p$ to keep generation quality roughly stable? - Many systems combine top-k and top-p (apply both filters). Why might you want both, and in what order would you apply them? - Top-p still has a hard cutoff at the cumulative-mass boundary. What alternative (e.g., min-p or typical/locally-typical sampling, or temperature-only) might address the residual "arbitrary boundary" problem, and what is its tradeoff? - How would you empirically choose $p$ for a factual question-answering system versus an open-ended story generator, and what failure modes would you watch for at each extreme ($p \to 0$ and $p \to 1$)?

Quick Answer: This question evaluates understanding of autoregressive decoding strategies in probabilistic language models, assessing competency in reasoning about sampling-based token selection, probability distribution behavior, and trade-offs between diversity and likelihood.

Related Interview Questions

  • Explain LLM training, RL, and evaluation - Cohere (medium)
Cohere logo
Cohere
May 23, 2026, 12:00 AM
Machine Learning Engineer
Onsite
Machine Learning
1
0

In autoregressive text generation, a language model produces a vector of logits for the next token at each step. These logits are converted to a probability distribution over the vocabulary (typically via a temperature-scaled softmax), and a decoding strategy chooses which token to emit. Greedy decoding always selects the single most likely token. Stochastic strategies instead sample from a restricted subset of likely tokens, trading off coherence against diversity.

Two of the most common truncation-based sampling strategies are top-k decoding (keep the kkk highest-probability tokens, renormalize, then sample) and top-p / nucleus decoding (keep the smallest set of tokens whose cumulative probability reaches a threshold ppp, renormalize, then sample).

You are interviewing for an ML role. Answer the following about these two strategies, with concrete reasoning about why each behavior occurs.

Constraints & Assumptions

  • Focus on the next-token sampling step of an already-trained autoregressive LM; you do not need to discuss training, beam search, or model architecture.
  • Assume softmax probabilities p1≥p2≥⋯≥pVp_1 \ge p_2 \ge \dots \ge p_Vp1​≥p2​≥⋯≥pV​ over a vocabulary of size VVV , optionally after temperature scaling.
  • Where helpful, reason about two regimes: a peaked (low-entropy / confident) distribution and a flat (high-entropy / uncertain) distribution.

Clarifying Questions to Ask

  • Is temperature applied before or after the truncation step, and is it held fixed across the comparison? (It changes the effective sharpness of the distribution and interacts with both kkk and ppp .)
  • Are we evaluating for a single task/style (e.g., factual QA vs. open-ended creative writing), or comparing across tasks? The "right" decoding choice is task-dependent.
  • Should the answer assume well-calibrated probabilities, or account for the fact that the model's tail probabilities may be unreliable?
  • Are top-k and top-p meant to be used in isolation, or can they be combined (and with temperature)?

Part 1

What are the main flaws of top-k decoding?

What This Part Should Cover

  • That a fixed kkk is insensitive to the entropy/shape of the per-step distribution.
  • The peaked-distribution failure: kkk too large admits low-probability tail tokens that can still be sampled after renormalization.
  • The flat-distribution failure: kkk too small excludes many near-equally-plausible tokens, reducing diversity.
  • The arbitrary, hard cutoff at the boundary (the kkk -th vs. (k+1)(k{+}1)(k+1) -th token may be nearly tied) and the need to tune kkk per model/task/temperature.

Part 2

What is top-p (nucleus) decoding? Describe the algorithm precisely.

What This Part Should Cover

  • The precise per-step procedure: softmax → sort descending → take the smallest prefix whose cumulative mass ≥p\ge p≥p → renormalize over that nucleus → sample.
  • That the nucleus size is dynamic and varies step to step.
  • The role of ppp (e.g., p=0.9p = 0.9p=0.9 ) and how the renormalization step works.
  • A correct worked intuition (small example) showing which tokens fall inside vs. outside the nucleus.

Part 3

How does top-p decoding address some of the weaknesses of top-k decoding?

What This Part Should Cover

  • That top-p adapts the candidate-set size to the model's per-step uncertainty (small nucleus when peaked, large when flat).
  • How this fixes the peaked-distribution junk-token problem and the flat-distribution truncation problem from Part 1.
  • An honest statement of what top-p does not fix (still a hyperparameter; still sensitive to miscalibrated tail probabilities; a hard cutoff at the mass boundary remains).

What a Strong Answer Covers

Across all parts, a strong candidate connects the mechanics to behavior rather than reciting definitions. Specifically:

  • Frames both methods as truncation + renormalize + sample , differing only in how the truncation set is chosen (fixed count vs. cumulative mass).
  • Reasons explicitly about the peaked vs. flat regimes and uses them consistently across all three parts.
  • Distinguishes "probabilities are arbitrary cut" (both methods) from "the cut location adapts to the distribution" (only top-p).
  • Notes the interaction with temperature and calibration , and avoids overclaiming that top-p is strictly superior.

Follow-up Questions

  • How do temperature and top-p interact? If you raise temperature, what should you do to ppp to keep generation quality roughly stable?
  • Many systems combine top-k and top-p (apply both filters). Why might you want both, and in what order would you apply them?
  • Top-p still has a hard cutoff at the cumulative-mass boundary. What alternative (e.g., min-p or typical/locally-typical sampling, or temperature-only) might address the residual "arbitrary boundary" problem, and what is its tradeoff?
  • How would you empirically choose ppp for a factual question-answering system versus an open-ended story generator, and what failure modes would you watch for at each extreme ( p→0p \to 0p→0 and p→1p \to 1p→1 )?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Cohere•More Machine Learning Engineer•Cohere Machine Learning Engineer•Cohere Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.