PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/TikTok

Implement attention and nucleus sampling; compare to top-k

Last updated: Mar 29, 2026

Quick Overview

This question evaluates implementation and theoretical understanding of Transformer components—multi-head scaled dot-product attention and nucleus (top-p) sampling—testing competencies in tensor shaping, masking, numerical stability, attention mechanics, and probabilistic sampling within the Machine Learning / Deep Learning domain.

  • hard
  • TikTok
  • Machine Learning
  • Machine Learning Engineer

Implement attention and nucleus sampling; compare to top-k

Company: TikTok

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

Implement multi-head attention from scratch: given query, key, and value tensors of shape (batch_size, seq_len, d_model), the number of heads h, and an optional attention mask, compute the output tensor with shape (batch_size, seq_len, d_model). Include linear projections to head dimensions, scaled dot-product attention with softmax, masking, dropout, concatenation of heads, and the final output projection. Follow-up: Why is the dot-product attention scaled by the square root of n? Implement nucleus (top-p) sampling for a language model: given logits over a vocabulary and a threshold p in (0, 1], select the smallest set of tokens whose cumulative probability is at least p, renormalize their probabilities, and sample the next token. Follow-up: What are the advantages and disadvantages of top-p sampling compared with top-k sampling?

Quick Answer: This question evaluates implementation and theoretical understanding of Transformer components—multi-head scaled dot-product attention and nucleus (top-p) sampling—testing competencies in tensor shaping, masking, numerical stability, attention mechanics, and probabilistic sampling within the Machine Learning / Deep Learning domain.

Related Interview Questions

  • Design multimodal deployment under compute limits - TikTok (easy)
  • Write self-attention and cross-entropy pseudocode - TikTok (medium)
  • Explain overfitting, dropout, normalization, RL post-training - TikTok (medium)
  • Answer ML fundamentals and diagnostics questions - TikTok (hard)
  • Implement AUC-ROC, softmax, and logistic regression - TikTok (medium)
|Home/Machine Learning/TikTok

Implement attention and nucleus sampling; compare to top-k

TikTok logo
TikTok
Aug 11, 2025, 12:00 AM
hardMachine Learning EngineerTechnical ScreenMachine Learning
5
0

Implement Multi‑Head Attention and Nucleus (Top‑p) Sampling

Context

You are building core components used in Transformer-based language models. Implement multi-head attention (MHA) from scratch and nucleus (top-p) sampling. Assume a deep learning framework (e.g., PyTorch) is available. Clearly handle shapes, masking, and numerical stability.

Tasks

  1. Multi‑Head Attention Implementation
  • Inputs:
    • Query Q, Key K, Value V: tensors of shape (batch_size, seq_len, d_model)
    • Number of heads h (assume d_model % h == 0)
    • Optional attention mask (broadcastable to attention scores shape)
  • Requirements:
    • Linear projections to head dimensions d_k = d_v = d_model / h
    • Scaled dot‑product attention with softmax
    • Support masking (e.g., padding or causal masks)
    • Apply dropout to attention weights
    • Concatenate heads and apply a final output projection
    • Output shape: (batch_size, seq_len, d_model)

Follow‑up A: Why is the dot‑product attention scaled by sqrt(d_k)?

  1. Nucleus (Top‑p) Sampling Implementation
  • Inputs:
    • Logits over a vocabulary (shape (vocab_size,) or (batch_size, vocab_size))
    • Threshold p in (0, 1]
  • Requirements:
    • Convert logits to probabilities
    • Select the smallest set of tokens whose cumulative probability ≥ p
    • Renormalize the selected probabilities
    • Sample the next token from this set

Follow‑up B: What are the advantages and disadvantages of top‑p sampling compared with top‑k sampling?

Loading comments...

Browse More Questions

More Machine Learning•More TikTok•More Machine Learning Engineer•TikTok Machine Learning Engineer•TikTok Machine Learning•Machine Learning Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.