Implement Multi-Head Self-Attention
Company: Uber
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
Implement a **multi-head self-attention** module in **PyTorch** without using `torch.nn.MultiheadAttention`.
Requirements:
- Input tensor shape: `(batch_size, seq_len, d_model)`
- Number of heads: `num_heads`, where `d_model % num_heads == 0`
- Use learned linear projections to produce **Q**, **K**, and **V** from the same input
- Split the projected tensors into multiple heads
- Compute scaled dot-product attention for each head:
`Attention(Q, K, V) = softmax(Q K^T / sqrt(head_dim)) V`
- Support an optional attention mask so masked positions do not contribute to attention scores
- Concatenate all heads and apply a final output projection
- Return a tensor of shape `(batch_size, seq_len, d_model)`
You may assume standard PyTorch layers such as `nn.Linear`, `torch.matmul`, `softmax`, `view`, and `transpose` are available.
Explain any important tensor shape transformations and common implementation pitfalls.
Quick Answer: This question evaluates understanding of multi-head self-attention and the competency to implement transformer attention modules using learned Q/K/V projections, head-wise tensor reshaping, attention masking, and PyTorch tensor operations.