PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/TikTok

Implement attention and nucleus sampling; compare to top-k

Last updated: Mar 29, 2026

Quick Overview

This question evaluates implementation and theoretical understanding of Transformer components—multi-head scaled dot-product attention and nucleus (top-p) sampling—testing competencies in tensor shaping, masking, numerical stability, attention mechanics, and probabilistic sampling within the Machine Learning / Deep Learning domain.

  • hard
  • TikTok
  • Machine Learning
  • Machine Learning Engineer

Implement attention and nucleus sampling; compare to top-k

Company: TikTok

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

Implement multi-head attention from scratch: given query, key, and value tensors of shape (batch_size, seq_len, d_model), the number of heads h, and an optional attention mask, compute the output tensor with shape (batch_size, seq_len, d_model). Include linear projections to head dimensions, scaled dot-product attention with softmax, masking, dropout, concatenation of heads, and the final output projection. Follow-up: Why is the dot-product attention scaled by the square root of n? Implement nucleus (top-p) sampling for a language model: given logits over a vocabulary and a threshold p in (0, 1], select the smallest set of tokens whose cumulative probability is at least p, renormalize their probabilities, and sample the next token. Follow-up: What are the advantages and disadvantages of top-p sampling compared with top-k sampling?

Quick Answer: This question evaluates implementation and theoretical understanding of Transformer components—multi-head scaled dot-product attention and nucleus (top-p) sampling—testing competencies in tensor shaping, masking, numerical stability, attention mechanics, and probabilistic sampling within the Machine Learning / Deep Learning domain.

Related Interview Questions

  • Design multimodal deployment under compute limits - TikTok (easy)
  • Explain overfitting, dropout, normalization, RL post-training - TikTok (medium)
  • Write self-attention and cross-entropy pseudocode - TikTok (medium)
  • Implement AUC-ROC, softmax, and logistic regression - TikTok (medium)
  • Answer ML fundamentals and diagnostics questions - TikTok (hard)
TikTok logo
TikTok
Aug 11, 2025, 12:00 AM
Machine Learning Engineer
Technical Screen
Machine Learning
4
0

Implement Multi‑Head Attention and Nucleus (Top‑p) Sampling

Context

You are building core components used in Transformer-based language models. Implement multi-head attention (MHA) from scratch and nucleus (top-p) sampling. Assume a deep learning framework (e.g., PyTorch) is available. Clearly handle shapes, masking, and numerical stability.

Tasks

  1. Multi‑Head Attention Implementation
  • Inputs:
    • Query Q, Key K, Value V: tensors of shape (batch_size, seq_len, d_model)
    • Number of heads h (assume d_model % h == 0)
    • Optional attention mask (broadcastable to attention scores shape)
  • Requirements:
    • Linear projections to head dimensions d_k = d_v = d_model / h
    • Scaled dot‑product attention with softmax
    • Support masking (e.g., padding or causal masks)
    • Apply dropout to attention weights
    • Concatenate heads and apply a final output projection
    • Output shape: (batch_size, seq_len, d_model)

Follow‑up A: Why is the dot‑product attention scaled by sqrt(d_k)?

  1. Nucleus (Top‑p) Sampling Implementation
  • Inputs:
    • Logits over a vocabulary (shape (vocab_size,) or (batch_size, vocab_size))
    • Threshold p in (0, 1]
  • Requirements:
    • Convert logits to probabilities
    • Select the smallest set of tokens whose cumulative probability ≥ p
    • Renormalize the selected probabilities
    • Sample the next token from this set

Follow‑up B: What are the advantages and disadvantages of top‑p sampling compared with top‑k sampling?

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More TikTok•More Machine Learning Engineer•TikTok Machine Learning Engineer•TikTok Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.