How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Onsite rounds at Cohere.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Cohere during technical interviews.

Explain Top-k and Top-p Decoding | Cohere Interview Question

Q: Explain Top-k and Top-p Decoding

This question evaluates understanding of autoregressive decoding strategies in probabilistic language models, assessing competency in reasoning about sampling-based token selection, probability distribution behavior, and trade-offs between diversity and likelihood.

In autoregressive text generation, a language model produces a vector of logits for the next token at each step. These logits are converted to a probability distribution over the vocabulary (typically via a temperature-scaled softmax), and a decoding strategy chooses which token to emit. Greedy decoding always selects the single most likely token. Stochastic strategies instead sample from a restricted subset of likely tokens, trading off coherence against diversity.

Two of the most common truncation-based sampling strategies are top-k decoding (keep the $k$ highest-probability tokens, renormalize, then sample) and top-p / nucleus decoding (keep the smallest set of tokens whose cumulative probability reaches a threshold $p$ , renormalize, then sample).

You are interviewing for an ML role. Answer the following about these two strategies, with concrete reasoning about why each behavior occurs.

Constraints & Assumptions

Focus on the next-token sampling step of an already-trained autoregressive LM; you do not need to discuss training, beam search, or model architecture.
Assume softmax probabilities $p_1 \ge p_2 \ge \dots \ge p_V$ over a vocabulary of size $V$ , optionally after temperature scaling.
Where helpful, reason about two regimes: a peaked (low-entropy / confident) distribution and a flat (high-entropy / uncertain) distribution.

Clarifying Questions to Ask

Is temperature applied before or after the truncation step, and is it held fixed across the comparison? (It changes the effective sharpness of the distribution and interacts with both $k$ and $p$ .)
Are we evaluating for a single task/style (e.g., factual QA vs. open-ended creative writing), or comparing across tasks? The "right" decoding choice is task-dependent.
Should the answer assume well-calibrated probabilities, or account for the fact that the model's tail probabilities may be unreliable?
Are top-k and top-p meant to be used in isolation, or can they be combined (and with temperature)?

Part 1

What are the main flaws of top-k decoding?

What This Part Should Cover

That a fixed $k$ is insensitive to the entropy/shape of the per-step distribution.
The peaked-distribution failure: $k$ too large admits low-probability tail tokens that can still be sampled after renormalization.
The flat-distribution failure: $k$ too small excludes many near-equally-plausible tokens, reducing diversity.
The arbitrary, hard cutoff at the boundary (the $k$ -th vs. $(k{+}1)$ -th token may be nearly tied) and the need to tune $k$ per model/task/temperature.

Part 2

What is top-p (nucleus) decoding? Describe the algorithm precisely.

What This Part Should Cover

The precise per-step procedure: softmax → sort descending → take the smallest prefix whose cumulative mass $\ge p$ → renormalize over that nucleus → sample.
That the nucleus size is dynamic and varies step to step.
The role of $p$ (e.g., $p = 0.9$ ) and how the renormalization step works.
A correct worked intuition (small example) showing which tokens fall inside vs. outside the nucleus.

Part 3

How does top-p decoding address some of the weaknesses of top-k decoding?

What This Part Should Cover

That top-p adapts the candidate-set size to the model's per-step uncertainty (small nucleus when peaked, large when flat).
How this fixes the peaked-distribution junk-token problem and the flat-distribution truncation problem from Part 1.
An honest statement of what top-p does not fix (still a hyperparameter; still sensitive to miscalibrated tail probabilities; a hard cutoff at the mass boundary remains).

What a Strong Answer Covers

Across all parts, a strong candidate connects the mechanics to behavior rather than reciting definitions. Specifically:

Frames both methods as truncation + renormalize + sample , differing only in how the truncation set is chosen (fixed count vs. cumulative mass).
Reasons explicitly about the peaked vs. flat regimes and uses them consistently across all three parts.
Distinguishes "probabilities are arbitrary cut" (both methods) from "the cut location adapts to the distribution" (only top-p).
Notes the interaction with temperature and calibration , and avoids overclaiming that top-p is strictly superior.

Follow-up Questions

How do temperature and top-p interact? If you raise temperature, what should you do to $p$ to keep generation quality roughly stable?
Many systems combine top-k and top-p (apply both filters). Why might you want both, and in what order would you apply them?
Top-p still has a hard cutoff at the cumulative-mass boundary. What alternative (e.g., min-p or typical/locally-typical sampling, or temperature-only) might address the residual "arbitrary boundary" problem, and what is its tradeoff?
How would you empirically choose $p$ for a factual question-answering system versus an open-ended story generator, and what failure modes would you watch for at each extreme ( $p \to 0$ and $p \to 1$ )?

You are interviewing for an ML role. Answer the following about these two strategies, with concrete reasoning about why each behavior occurs.

Constraints & Assumptions

Focus on the next-token sampling step of an already-trained autoregressive LM; you do not need to discuss training, beam search, or model architecture.
Assume softmax probabilities $p_1 \ge p_2 \ge \dots \ge p_V$ over a vocabulary of size $V$ , optionally after temperature scaling.
Where helpful, reason about two regimes: a peaked (low-entropy / confident) distribution and a flat (high-entropy / uncertain) distribution.

Clarifying Questions to Ask

Is temperature applied before or after the truncation step, and is it held fixed across the comparison? (It changes the effective sharpness of the distribution and interacts with both $k$ and $p$ .)
Are we evaluating for a single task/style (e.g., factual QA vs. open-ended creative writing), or comparing across tasks? The "right" decoding choice is task-dependent.
Should the answer assume well-calibrated probabilities, or account for the fact that the model's tail probabilities may be unreliable?
Are top-k and top-p meant to be used in isolation, or can they be combined (and with temperature)?

Part 1

What are the main flaws of top-k decoding?

What This Part Should Cover

That a fixed $k$ is insensitive to the entropy/shape of the per-step distribution.
The peaked-distribution failure: $k$ too large admits low-probability tail tokens that can still be sampled after renormalization.
The flat-distribution failure: $k$ too small excludes many near-equally-plausible tokens, reducing diversity.
The arbitrary, hard cutoff at the boundary (the $k$ -th vs. $(k{+}1)$ -th token may be nearly tied) and the need to tune $k$ per model/task/temperature.

Part 2

What is top-p (nucleus) decoding? Describe the algorithm precisely.

What This Part Should Cover

The precise per-step procedure: softmax → sort descending → take the smallest prefix whose cumulative mass $\ge p$ → renormalize over that nucleus → sample.
That the nucleus size is dynamic and varies step to step.
The role of $p$ (e.g., $p = 0.9$ ) and how the renormalization step works.
A correct worked intuition (small example) showing which tokens fall inside vs. outside the nucleus.

Part 3

How does top-p decoding address some of the weaknesses of top-k decoding?

What This Part Should Cover

That top-p adapts the candidate-set size to the model's per-step uncertainty (small nucleus when peaked, large when flat).
How this fixes the peaked-distribution junk-token problem and the flat-distribution truncation problem from Part 1.
An honest statement of what top-p does not fix (still a hyperparameter; still sensitive to miscalibrated tail probabilities; a hard cutoff at the mass boundary remains).

What a Strong Answer Covers

Across all parts, a strong candidate connects the mechanics to behavior rather than reciting definitions. Specifically:

Frames both methods as truncation + renormalize + sample , differing only in how the truncation set is chosen (fixed count vs. cumulative mass).
Reasons explicitly about the peaked vs. flat regimes and uses them consistently across all three parts.
Distinguishes "probabilities are arbitrary cut" (both methods) from "the cut location adapts to the distribution" (only top-p).
Notes the interaction with temperature and calibration , and avoids overclaiming that top-p is strictly superior.

Follow-up Questions

How do temperature and top-p interact? If you raise temperature, what should you do to $p$ to keep generation quality roughly stable?
Many systems combine top-k and top-p (apply both filters). Why might you want both, and in what order would you apply them?
Top-p still has a hard cutoff at the cumulative-mass boundary. What alternative (e.g., min-p or typical/locally-typical sampling, or temperature-only) might address the residual "arbitrary boundary" problem, and what is its tradeoff?
How would you empirically choose $p$ for a factual question-answering system versus an open-ended story generator, and what failure modes would you watch for at each extreme ( $p \to 0$ and $p \to 1$ )?

Explain Top-k and Top-p Decoding

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

Part 1

What This Part Should Cover

Part 2

What This Part Should Cover

Part 3

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Explain Top-k and Top-p Decoding

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

Part 1

What This Part Should Cover

Part 2

What This Part Should Cover

Part 3

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP