Transformer Self-Attention: Q, K, V, Multi-Head, and Positional Encoding
Context: You are given a sequence of token embeddings X (length n, model dimension d_model). Focus on the scaled dot-product self-attention inside a Transformer block.
Answer the following:
-
Define the query (Q), key (K), and value (V) matrices:
-
How are Q, K, V produced from input embeddings?
-
What information does each carry?
-
What specifically does the V matrix represent, and how is it used after attention weights are computed?
-
At a high level, how do similarity scores become attention weights and then outputs?
-
Compare Transformers to RNNs/LSTMs:
-
How do Transformers address sequential dependency and long-range context limitations?
-
Briefly outline multi-head attention and positional encoding:
-
What are they, and why are they needed?
-
When do they matter at inference time (e.g., generation/caching, positional schemes)?