You are implementing a Transformer attention layer.
Given:
L
and model dimension
d_model
.
h
, where
d_head = d_model / h
(assume divisible).
Q
,
K
,
V
each of shape
(L, d_model)
.
Wq
,
Wk
,
Wv
each of shape
(d_model, d_model)
and output projection
Wo
of shape
(d_model, d_model)
.
mask
of shape
(L, L)
where
mask[i][j] = 1
means position
i
may attend to
j
, and
0
means it must not.
Compute the multi-head scaled dot-product attention output O of shape (L, d_model):
Q' = QWq
,
K' = KWk
,
V' = VWv
.
h
heads (reshape to
(h, L, d_head)
).
scores = (Q_head @ K_head^T) / sqrt(d_head)
producing
(L, L)
.
-inf
(or a very negative number) before softmax.
weights = softmax(scores)
over the last dimension.
head_out = weights @ V_head
producing
(L, d_head)
.
(L, d_model)
and apply
Wo
.
Clarify in your explanation:
L
,
d_model
, and
h
.
L=1
).
You are implementing a next-token sampler for an LLM.
Given:
logits
of length
V
(vocabulary size) for the next token.
temperature > 0
, optional
top_k
(integer), optional
top_p
(0 < p ≤ 1), and
seed
for reproducibility.
Implement a function that returns a sampled token id.
Requirements:
Discuss how you would handle:
temperature → 0
behavior.
top_k
and
top_p
are provided.