You are asked to explain core Transformer / deep learning components.
Part A — Self-attention pseudocode
Write clear pseudocode (not full code) for scaled dot-product self-attention for a single attention head. Your pseudocode should include:
-
Inputs/outputs and tensor shapes (batch size
B
, sequence length
T
, model dim
d_model
, head dim
d_k
)
-
Computing
Q, K, V
via linear projections
-
Computing attention logits and applying scaling
-
Softmax and weighted sum
-
(Optional but recommended) Handling an attention mask (padding mask or causal mask)
Part B — Cross-entropy pseudocode
Write pseudocode for multi-class cross-entropy loss for a batch of examples, given:
-
Model logits
z
of shape
[B, C]
-
Ground-truth labels either as class indices
[B]
or one-hot
[B, C]
-
Return a scalar loss (mean over batch)
Part C — Concept questions
-
In a Transformer block, what is the role of the
position-wise feed-forward network (FFN)
relative to attention? Why is it needed?
-
Why do we scale the dot-product attention scores by
1 / sqrt(d_k)
before applying softmax? What problem does it address?