This question evaluates implementation-level mastery of transformer architectures, including multi-head masked self-attention, position-wise feed-forward networks, residual connections, and precise tensor-shape reasoning for a decoder-only language model, within the Coding & Algorithms domain of deep learning and neural network engineering.
Implement a simplified decoder-only Transformer language model (similar in spirit to GPT) for next-token prediction. The implementation should be modular, using four main classes:
MultiHeadAttention
FeedForward
DecoderLayer
GPT
(the full model)
You may assume a deep learning framework (e.g., PyTorch or TensorFlow), but you should clearly specify tensor shapes and operations.
Assume:
B
T
d_model
num_heads
(assume
d_model
is divisible by
num_heads
)
V
MultiHeadAttention)Implement a class MultiHeadAttention that performs masked self-attention:
x
: Tensor of shape
(B, T, d_model)
(token embeddings or layer outputs).
Q
, keys
K
, and values
V
for all heads.
x
into
Q
,
K
, and
V
with shapes
(B, T, d_model)
.
num_heads
heads, each of dimension
d_head = d_model / num_heads
.
where `M` is a **causal mask** that prevents a position from attending to future positions (positions `> t`).
4. Concatenate the outputs of all heads back to shape (B, T, d_model).
5. Apply a final linear projection to return to (B, T, d_model).
(B, T, d_model)
.
Ensure you correctly implement the causal (autoregressive) mask so that position t can only attend to positions 0..t.
FeedForward)Implement a class FeedForward for the position-wise MLP:
x
: Tensor of shape
(B, T, d_model)
.
d_model
to
d_ff
(e.g.,
4 * d_model
).
d_ff
back to
d_model
.
(B, T, d_model)
.
DecoderLayer)Implement a single Transformer decoder block that combines attention, feed-forward, residual connections, and layer normalization:
x
: Tensor of shape
(B, T, d_model)
.
ln1
before self-attention.
MultiHeadAttention
block (masked self-attention).
x = x + attention_output
.
ln2
before feed-forward.
FeedForward
block.
x = x + ff_output
.
(B, T, d_model)
.
Your DecoderLayer should be reusable so that multiple layers can be stacked.
GPT)Implement a GPT class representing the full decoder-only Transformer:
input_ids
: Integer tensor of shape
(B, T)
representing token indices.
d_model
.
(T, d_model)
added to the token embeddings.
N
identical
DecoderLayer
blocks.
d_model
to vocabulary size
V
.
(B, T, d_model)
.
N
decoder layers sequentially.
(B, T, V)
.
(B, T, V)
.
forward
method for each class with correct tensor transformations.