This question evaluates implementation and theoretical understanding of Transformer components—multi-head scaled dot-product attention and nucleus (top-p) sampling—testing competencies in tensor shaping, masking, numerical stability, attention mechanics, and probabilistic sampling within the Machine Learning / Deep Learning domain.

You are building core components used in Transformer-based language models. Implement multi-head attention (MHA) from scratch and nucleus (top-p) sampling. Assume a deep learning framework (e.g., PyTorch) is available. Clearly handle shapes, masking, and numerical stability.
Follow‑up A: Why is the dot‑product attention scaled by sqrt(d_k)?
Follow‑up B: What are the advantages and disadvantages of top‑p sampling compared with top‑k sampling?
Login required