This question evaluates understanding of Transformer internals—specifically scaled dot-product self-attention (including Q/K/V projections and attention masks), multi-class cross-entropy loss computation, and conceptual roles like the position-wise feed-forward network and attention score scaling—in the Machine Learning / deep learning domain.
You are asked to explain core Transformer / deep learning components.
Write clear pseudocode (not full code) for scaled dot-product self-attention for a single attention head. Your pseudocode should include:
B
, sequence length
T
, model dim
d_model
, head dim
d_k
)
Q, K, V
via linear projections
Write pseudocode for multi-class cross-entropy loss for a batch of examples, given:
z
of shape
[B, C]
[B]
or one-hot
[B, C]
1 / sqrt(d_k)
before applying softmax? What problem does it address?