Implement attention and nucleus sampling; compare to top-k
Company: TikTok
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
Implement multi-head attention from scratch: given query, key, and value tensors of shape (batch_size, seq_len, d_model), the number of heads h, and an optional attention mask, compute the output tensor with shape (batch_size, seq_len, d_model). Include linear projections to head dimensions, scaled dot-product attention with softmax, masking, dropout, concatenation of heads, and the final output projection. Follow-up: Why is the dot-product attention scaled by the square root of n? Implement nucleus (top-p) sampling for a language model: given logits over a vocabulary and a threshold p in (0, 1], select the smallest set of tokens whose cumulative probability is at least p, renormalize their probabilities, and sample the next token. Follow-up: What are the advantages and disadvantages of top-p sampling compared with top-k sampling?
Quick Answer: This question evaluates implementation and theoretical understanding of Transformer components—multi-head scaled dot-product attention and nucleus (top-p) sampling—testing competencies in tensor shaping, masking, numerical stability, attention mechanics, and probabilistic sampling within the Machine Learning / Deep Learning domain.