Implement a Transformer Block with SwiGLU
Company: Google
Role: Machine Learning Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates implementation and reasoning skills for Transformer internals, testing competency in multi-head self-attention, SwiGLU gated feed-forward layers, residual connections, and layer normalization within the Machine Learning / deep learning domain.
Constraints
- 1 <= batch_size <= 10
- 1 <= sequence_length <= 30
- 1 <= hidden_dim <= 64
- hidden_dim % num_heads == 0
- w_q, w_k, w_v, w_o are all (hidden_dim x hidden_dim)
- w_gate and w_value have the same shape (hidden_dim x ffn_dim)
- w_ffn_out has shape (ffn_dim x hidden_dim)
- Every token vector has non-zero variance across the hidden dimension
- If a mask is provided, every query position has at least one unmasked key
Examples
Input: ([[[1, 0], [0, 1]]], 1, [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [1, 1], [0, 0], [1, 1], [0, 0], None)
Expected Output: [[[1.888386, -0.888386], [-0.888386, 1.888386]]]
Explanation: Layer normalization turns the two tokens into [1, -1] and [-1, 1]. With identity Q/K/V/O and one head, each token attends mostly to itself. The feed-forward part is zero, so only the attention residual changes the input.
Input: ([[[1, 0], [0, 1]]], 1, [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [1, 1], [0, 0], [1, 1], [0, 0], [[1, 0], [1, 1]])
Expected Output: [[[2.0, -1.0], [-0.888386, 1.888386]]]
Explanation: The mask blocks token 0 from attending to token 1, so token 0 attends only to itself and adds [1, -1] to the residual. Token 1 can still attend to both positions.
Input: ([[[1, -1]]], 1, [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [1, 1], [0, 0], [1, 1], [0, 0], None)
Expected Output: [[[1.731059, -0.731059]]]
Explanation: Attention is zero because all attention projection matrices are zero. The normalized token is [1, -1], so SwiGLU computes swish([1, -1]) * [1, -1] = [0.731059, 0.268941], which is added back through the residual.
Input: ([[[1, 2, 3, 4]]], 2, [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [1, 1, 1, 1], [0, 0, 0, 0], [1, 1, 1, 1], [0, 0, 0, 0], None)
Expected Output: [[[-0.341641, 1.552786, 3.447214, 5.341641]]]
Explanation: This checks the multi-head split with hidden_dim=4 and num_heads=2. With a single token, each head attends only to that token, so the attention output is exactly the normalized input, which is added back through the residual.
Hints
- After computing Q, K, and V with shape (B, T, D), reshape each to (B, T, H, D/H) and transpose to (B, H, T, D/H) so attention is computed independently per head.
- Apply the mask to the attention score matrix before softmax by replacing blocked positions with a very large negative number. For SwiGLU, compute the gate and value projections from the same normalized input, then use swish(gate) * value.