This question evaluates a candidate's practical implementation skills and conceptual understanding of multi-head self-attention, including query/key/value projections, head-wise tensor reshaping, masking behavior, and considerations for numerical stability and computational complexity.
You are given an input tensor X with shape (batch_size, seq_len, d_model). Implement a multi-head self-attention layer (forward pass) using PyTorch or NumPy that:
Assume d_model is divisible by h.
Login required