This question evaluates a candidate's competency in attention mechanisms for autoregressive Transformers, specifically the implementation of causal and padding masks, their combination in multi-head attention, and the ability to diagnose masking-related training anomalies.
Context: You are implementing decoder self-attention for an autoregressive Transformer where input sequences in a batch are right-padded to a common length. You must prevent attention to future tokens and exclude padded positions.
Task:
Assumptions:
Login required