LLM Decoding, Sampling, And Logit Controls
Asked of: ML Engineer
Last updated

What's being tested
Candidates must show practical mastery of probabilistic decoding from autoregressive models: how sampling algorithms, temperature scaling, and logit controls (biases, penalties, constraints) change output quality, safety, and latency in production. Interviewers probe that you can pick and implement the right decoding strategy for deployment constraints, reason about numerical and tokenization edge cases, and measure the tradeoffs (diversity vs. reliability vs. performance).
Core knowledge
-
Softmax and temperature math: probabilities are ; temperature deterministic
argmax, flattens distribution, sharpens it; always apply numerical stability trick: subtract max(logits). -
Top-k (
top-k) sampling: keep K highest logits, zero others then renormalize; complexity naïve, use selection algorithms or GPUtop-kkernels; deterministic sparsity controls tail noise. -
Nucleus (
top-p) sampling: choose smallest set S with (e.g., 0.9); adapts size by distribution mass, useful when head mass varies across timesteps. -
Greedy / beam search vs. sampling: beam search optimizes sequence score (approx. MAP) and increases latency and memory; greedy is fastest but brittle; sampling produces diverse outputs but can hallucinate.
-
Logit biasing: implement as (bias vector) or per-token adjustments (presence/frequency penalty); always apply bias before
softmax/top-k/top-p; large negative biases effectively block tokens. -
Frequency / presence penalties: subtract or for presence; careful: penalties interact with subword tokenization—repeated semantic word can map to different tokens.
-
Constrained/forced decoding: enforce allowed continuations via a prefix trie (finite-state automaton) on token IDs; used for deterministic formatting, blocklists, or lexicon constraints; efficient streaming checks avoid per step.
-
Safety blocklists: operate at token level with Unicode/subword issues; implement multi-stage checks (fast token-trie for common blocks, fallback classifier for ambiguous cases); log decisions for review.
-
Reproducibility & RNGs: use parallel-safe RNGs like
Philoxwith per-request seed; document seed + sampling state so sequences are repeatable across runs and A/B tests. -
Numerical & precision issues: FP16/FP32 logits can overflow in exp; subtract max-logit, and clip logits to reasonable range (e.g., ) before exponentiation.
-
Latency & batching tradeoffs: computing
top-k/top-pon huge vocab in CPU can be a bottleneck—use fused GPU kernels, pruning, or precomputed shortlist when latency SLOs require <50ms/token. -
Metrics for decoding choices: measure
ppl(perplexity) on held-out, diversity (distinct-n), repetition rates, and human/automatic safety rejection rates; calibrate with A/B tests and offline likelihood diagnostics.
Worked example — designing decoding for a low-latency chat service with safety constraints
Frame the problem: ask expected latency SLO, allowed token rate, determinism requirements, and severity/type of safety risks. Assume 50ms/token p99, must block profanity and ensure JSON-response formatting when requested. Organize the answer into three pillars: (1) algorithm choice for latency/quality (use greedy or low- top-k with K=40 for short replies), (2) safety/logit controls (token-level blocklist via trie + heavy negative logit bias for banned tokens; add forced decoding for JSON schema tokens), (3) infra/perf (use fused GPU top-k kernel, batch decoding where possible, deterministic RNG seeding). Explicit tradeoff to flag: stricter blocking and lower temperature reduce hallucination but also lower diversity—risk overconstraining user experience. Close by proposing offline evaluation: sweep and K on production-like prompts, measure rejection rate, distinct-n, and p99 latency; "if I had more time, I'd add adaptive decoding (lower when model confidence high) and audit tokenizer-induced bypasses."
A second angle — controlling repetition for long-form generation
Same primitives apply but constraints differ: for long outputs, repetition becomes dominant. Use frequency penalty and n-gram blocking implemented as a dynamic trie of recent token sequences to ban repeats. Prefer top-p with moderate temperature () so tail sampling preserves coherence while frequency penalties prevent loops. Beam search often worsens repetition (length normalization hacks needed); sampling with penalties typically gives better human-perceived diversity. Also evaluate using automatic repetition metrics (percentage of repeated 3-grams) and tune penalties per-output-length buckets.
Common pitfalls
Pitfall: treating temperature as a scale on probabilities rather than logits.
Altering must be applied to logits beforesoftmax. Multiplying probabilities directly produces incorrect, non-normalized distributions and breaks expected behavior.
Pitfall: assuming token-level blocklists operate on UTF-8 strings.
Subword tokenization can split banned words; a blocklist must be applied on token-ID sequences (with tokenization-aware tries) or combined with post-generation substring checks.
Pitfall: optimizing only for offline perplexity metrics.
Lower perplexity correlates with likelihood, not human-likeness or safety; always include human or task-specific metrics (distinct-n, repetition, safety rejection rate) in tuning.
Connections
This topic directly links to evaluation & monitoring (how you measure decoding policy drift, repetition, safety regressions) and to serving infrastructure (GPU kernels, batching, deterministic RNGs). Interviewers may pivot to tokenization/tokenizer design, or to alignment controls / RLHF when discussing safety tradeoffs.
Further reading
-
Holtzman et al., "The Curious Case of Neural Text Degeneration" (2020) — introduces nucleus sampling and demonstrates sampling-vs-beam tradeoffs.
-
OpenAI Cookbook, "Best Practices for Text Generation" — practical tips and code patterns for sampling, temperature, and logit biasing.
Related concepts
- LLM Foundations, Embeddings, Prompts, And Fine-Tuning
- LLM Eval Data Slicing and Debugging
- LLM Architecture, Tuning, And EvaluationMachine Learning
- RLHF And Preference Optimization Basics
- LLM Chat Applications, RAG, And ML EvaluationML System Design
- LLM Evaluation: Faithfulness, Hallucination, And Human Review For Meta AI