This question evaluates implementation and analytical skills for scaled dot-product multi-head self-attention and an encoder-style Transformer block, including manual forward and backward-pass computations, gradient derivation for projection matrices, causal masking for autoregressive next-token prediction, numerical stability of softmax, and time/memory complexity analysis within the Machine Learning domain, emphasizing practical implementation with underlying conceptual understanding of linear algebra and backpropagation mechanics. It is commonly asked to verify mastery of attention mechanisms and autodiff-free gradient reasoning, plus the ability to compute cross-entropy/perplexity, perform finite-difference gradient checks, and reason about algorithmic trade-offs.
Context: Build multi-head self-attention and a Transformer encoder-style block from scratch for autoregressive next-token prediction. Implement both forward and backward passes manually (no autograd). A numerically stable softmax and its backward are provided.
Assume:
Login required