Implement multi-head attention and LLM sampling

Q: Implement multi-head attention and LLM sampling

This is a Coding & Algorithms interview question from Scale AI for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

Loading...

Task A: Multi-head attention (forward pass)

You are implementing a Transformer attention layer.

Given:

A sequence length L and model dimension d_model .
Number of heads h , where d_head = d_model / h (assume divisible).
Input matrices Q , K , V each of shape (L, d_model) .
Projection weights Wq , Wk , Wv each of shape (d_model, d_model) and output projection Wo of shape (d_model, d_model) .
Optional attention mask mask of shape (L, L) where mask[i][j] = 1 means position i may attend to j , and 0 means it must not.

Compute the multi-head scaled dot-product attention output O of shape (L, d_model):

Project Q' = QWq , K' = KWk , V' = VWv .
Split each into h heads (reshape to (h, L, d_head) ).
For each head compute attention weights using scaled dot product and masking:
- scores = (Q_head @ K_head^T) / sqrt(d_head) producing (L, L) .
- Apply the mask by setting disallowed positions to -inf (or a very negative number) before softmax.
- weights = softmax(scores) over the last dimension.
- head_out = weights @ V_head producing (L, d_head) .
Concatenate heads back to (L, d_model) and apply Wo .

Clarify in your explanation:

How you handle the mask numerically.
The time and space complexity in terms of L , d_model , and h .
At least 2 edge cases (e.g., fully masked row, L=1 ).

Task B: Token sampling from logits

You are implementing a next-token sampler for an LLM.

Given:

A vector of logits logits of length V (vocabulary size) for the next token.
Parameters: temperature > 0 , optional top_k (integer), optional top_p (0 < p ≤ 1), and seed for reproducibility.

Implement a function that returns a sampled token id.

Requirements:

Apply temperature scaling.
Support top-k filtering and/or top-p (nucleus) filtering.
Convert to a probability distribution with softmax after filtering.
Sample one token from the resulting categorical distribution.

Discuss how you would handle:

temperature → 0 behavior.
Ties in logits.
Interaction when both top_k and top_p are provided.
Numerical stability for softmax.

Implement multi-head attention and LLM sampling

Task A: Multi-head attention (forward pass)

Task B: Token sampling from logits

Comments (0)