LLM Decoding, Sampling, And Logit Controls

What's being tested

Candidates must show practical mastery of probabilistic decoding from autoregressive models: how sampling algorithms, temperature scaling, and logit controls (biases, penalties, constraints) change output quality, safety, and latency in production. Interviewers probe that you can pick and implement the right decoding strategy for deployment constraints, reason about numerical and tokenization edge cases, and measure the tradeoffs (diversity vs. reliability vs. performance).

Core knowledge

Softmax and temperature math: probabilities are $p_i = \frac{exp(l_i / T)}{\Sigma_j exp(l_j / T)}$ ; temperature $T \to 0 \to$ deterministic argmax, $T > 1$ flattens distribution, $T < 1$ sharpens it; always apply numerical stability trick: subtract max(logits).
Top-k (top-k) sampling: keep K highest logits, zero others then renormalize; complexity $O(V)$ naïve, use selection algorithms or GPU top-k kernels; deterministic sparsity controls tail noise.
Nucleus (top-p) sampling: choose smallest set S with $\Sigma_{i \in S} p_i \geq p$ (e.g., 0.9); adapts size by distribution mass, useful when head mass varies across timesteps.
Greedy / beam search vs. sampling: beam search optimizes sequence score (approx. MAP) and increases latency and memory; greedy is fastest but brittle; sampling produces diverse outputs but can hallucinate.
Logit biasing: implement as $l'_i = l_i + b_i$ (bias vector) or per-token adjustments (presence/frequency penalty); always apply bias before softmax/top-k/top-p; large negative biases effectively block tokens.
Frequency / presence penalties: subtract $\alpha \cdot count(token)$ or $\beta$ for presence; careful: penalties interact with subword tokenization—repeated semantic word can map to different tokens.
Constrained/forced decoding: enforce allowed continuations via a prefix trie (finite-state automaton) on token IDs; used for deterministic formatting, blocklists, or lexicon constraints; efficient streaming checks avoid $O(V)$ per step.
Safety blocklists: operate at token level with Unicode/subword issues; implement multi-stage checks (fast token-trie for common blocks, fallback classifier for ambiguous cases); log decisions for review.
Reproducibility & RNGs: use parallel-safe RNGs like Philox with per-request seed; document seed + sampling state so sequences are repeatable across runs and A/B tests.
Numerical & precision issues: FP16/FP32 logits can overflow in exp; subtract max-logit, and clip logits to reasonable range (e.g., $[-100, 100]$ ) before exponentiation.
Latency & batching tradeoffs: computing top-k/top-p on huge vocab in CPU can be a bottleneck—use fused GPU kernels, pruning, or precomputed shortlist when latency SLOs require <50ms/token.
Metrics for decoding choices: measure ppl (perplexity) on held-out, diversity (distinct-n), repetition rates, and human/automatic safety rejection rates; calibrate $T/\alpha$ with A/B tests and offline likelihood diagnostics.

Worked example — designing decoding for a low-latency chat service with safety constraints

Frame the problem: ask expected latency SLO, allowed token rate, determinism requirements, and severity/type of safety risks. Assume 50ms/token p99, must block profanity and ensure JSON-response formatting when requested. Organize the answer into three pillars: (1) algorithm choice for latency/quality (use greedy or low- $T$ top-k with K=40 for short replies), (2) safety/logit controls (token-level blocklist via trie + heavy negative logit bias for banned tokens; add forced decoding for JSON schema tokens), (3) infra/perf (use fused GPU top-k kernel, batch decoding where possible, deterministic RNG seeding). Explicit tradeoff to flag: stricter blocking and lower temperature reduce hallucination but also lower diversity—risk overconstraining user experience. Close by proposing offline evaluation: sweep $T$ and K on production-like prompts, measure rejection rate, distinct-n, and p99 latency; "if I had more time, I'd add adaptive decoding (lower $T$ when model confidence high) and audit tokenizer-induced bypasses."

A second angle — controlling repetition for long-form generation

Same primitives apply but constraints differ: for long outputs, repetition becomes dominant. Use frequency penalty and n-gram blocking implemented as a dynamic trie of recent token sequences to ban repeats. Prefer top-p with moderate temperature ( $T \approx 0.7$ ) so tail sampling preserves coherence while frequency penalties prevent loops. Beam search often worsens repetition (length normalization hacks needed); sampling with penalties typically gives better human-perceived diversity. Also evaluate using automatic repetition metrics (percentage of repeated 3-grams) and tune penalties per-output-length buckets.

Common pitfalls

Pitfall: treating temperature as a scale on probabilities rather than logits.
Altering $T$ must be applied to logits before softmax. Multiplying probabilities directly produces incorrect, non-normalized distributions and breaks expected behavior.

Pitfall: assuming token-level blocklists operate on UTF-8 strings.
Subword tokenization can split banned words; a blocklist must be applied on token-ID sequences (with tokenization-aware tries) or combined with post-generation substring checks.

Pitfall: optimizing only for offline perplexity metrics.
Lower perplexity correlates with likelihood, not human-likeness or safety; always include human or task-specific metrics (distinct-n, repetition, safety rejection rate) in tuning.

Connections

This topic directly links to evaluation & monitoring (how you measure decoding policy drift, repetition, safety regressions) and to serving infrastructure (GPU kernels, batching, deterministic RNGs). Interviewers may pivot to tokenization/tokenizer design, or to alignment controls / RLHF when discussing safety tradeoffs.