How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at Google.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Google during technical interviews.

Implement a Transformer Block with SwiGLU

Quick Overview

This question evaluates implementation and reasoning skills for Transformer internals, testing competency in multi-head self-attention, SwiGLU gated feed-forward layers, residual connections, and layer normalization within the Machine Learning / deep learning domain.

Company: Google

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Implement a Transformer-style neural network block in Python using either NumPy or PyTorch. Your implementation should include: 1. **Multi-head self-attention** - Input tensor shape: `(batch_size, sequence_length, hidden_dim)`. - Split the hidden dimension into `num_heads` attention heads. - Compute scaled dot-product attention. - Support an optional attention mask. - Concatenate heads and apply an output projection. 2. **SwiGLU feed-forward network** - Implement a gated feed-forward layer using the SwiGLU formulation: `SwiGLU(x) = Swish(xW_gate) * (xW_value)`, followed by an output projection. - The gate and value projections should be computed in parallel. - Use `Swish(z) = z * sigmoid(z)`. 3. **Transformer block structure** - Include residual connections. - Include layer normalization. - Return an output tensor with the same shape as the input. Be prepared to explain tensor shapes, masking behavior, and the time and memory complexity of multi-head attention.

Quick Answer: This question evaluates implementation and reasoning skills for Transformer internals, testing competency in multi-head self-attention, SwiGLU gated feed-forward layers, residual connections, and layer normalization within the Machine Learning / deep learning domain.

Implement a Transformer-style block in Python using NumPy. Do not use high-level attention or feed-forward APIs. The block must follow this pre-normalization structure: 1. y = x + MultiHeadSelfAttention(LN1(x)) 2. out = y + SwiGLUFeedForward(LN2(y)) Details: - Input tensor x has shape (batch_size, sequence_length, hidden_dim). - Multi-head self-attention must: - project x into Q, K, V using weight matrices, - split hidden_dim evenly across num_heads, - compute scaled dot-product attention, - apply an optional mask before softmax, - concatenate all heads, - apply the output projection. - The SwiGLU feed-forward network must compute: - gate = x @ W_gate - value = x @ W_value - swish(gate) = gate * sigmoid(gate) - hidden = swish(gate) * value - output = hidden @ W_ffn_out - Layer normalization is applied over the last dimension only. - For deterministic grading, use layer normalization without epsilon: LN(t) = ((t - mean(t)) / sqrt(var(t))) * gamma + beta All test inputs are chosen so var(t) > 0 for every token. - There are no bias terms and no dropout. - The returned tensor must have the same shape as the input and each value must be rounded to 6 decimal places.

Constraints

1 <= batch_size <= 10
1 <= sequence_length <= 30
1 <= hidden_dim <= 64
hidden_dim % num_heads == 0
w_q, w_k, w_v, w_o are all (hidden_dim x hidden_dim)
w_gate and w_value have the same shape (hidden_dim x ffn_dim)
w_ffn_out has shape (ffn_dim x hidden_dim)
Every token vector has non-zero variance across the hidden dimension
If a mask is provided, every query position has at least one unmasked key

Examples

Input: ([[[1, 0], [0, 1]]], 1, [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [1, 1], [0, 0], [1, 1], [0, 0], None)

Expected Output: [[[1.888386, -0.888386], [-0.888386, 1.888386]]]

Explanation: Layer normalization turns the two tokens into [1, -1] and [-1, 1]. With identity Q/K/V/O and one head, each token attends mostly to itself. The feed-forward part is zero, so only the attention residual changes the input.

Input: ([[[1, 0], [0, 1]]], 1, [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [1, 1], [0, 0], [1, 1], [0, 0], [[1, 0], [1, 1]])

Expected Output: [[[2.0, -1.0], [-0.888386, 1.888386]]]

Explanation: The mask blocks token 0 from attending to token 1, so token 0 attends only to itself and adds [1, -1] to the residual. Token 1 can still attend to both positions.

Input: ([[[1, -1]]], 1, [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [1, 1], [0, 0], [1, 1], [0, 0], None)

Expected Output: [[[1.731059, -0.731059]]]

Explanation: Attention is zero because all attention projection matrices are zero. The normalized token is [1, -1], so SwiGLU computes swish([1, -1]) * [1, -1] = [0.731059, 0.268941], which is added back through the residual.

Input: ([[[1, 2, 3, 4]]], 2, [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [1, 1, 1, 1], [0, 0, 0, 0], [1, 1, 1, 1], [0, 0, 0, 0], None)

Expected Output: [[[-0.341641, 1.552786, 3.447214, 5.341641]]]

Explanation: This checks the multi-head split with hidden_dim=4 and num_heads=2. With a single token, each head attends only to that token, so the attention output is exactly the normalized input, which is added back through the residual.

Hints

After computing Q, K, and V with shape (B, T, D), reshape each to (B, T, H, D/H) and transpose to (B, H, T, D/H) so attention is computed independently per head.
Apply the mask to the attention score matrix before softmax by replacing blocked positions with a very large negative number. For SwiGLU, compute the gate and value projections from the same normalized input, then use swish(gate) * value.

Quick Overview