Implement multi-head self-attention correctly
Company: Apple
Role: Software Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
Quick Answer: This question evaluates a candidate's practical implementation skills and conceptual understanding of multi-head self-attention, including query/key/value projections, head-wise tensor reshaping, masking behavior, and considerations for numerical stability and computational complexity.