PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates implementation and reasoning skills for Transformer internals, testing competency in multi-head self-attention, SwiGLU gated feed-forward layers, residual connections, and layer normalization within the Machine Learning / deep learning domain.

  • medium
  • Google
  • Coding & Algorithms
  • Machine Learning Engineer

Implement a Transformer Block with SwiGLU

Company: Google

Role: Machine Learning Engineer

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

Implement a Transformer-style neural network block in Python using either NumPy or PyTorch. Your implementation should include: 1. **Multi-head self-attention** - Input tensor shape: `(batch_size, sequence_length, hidden_dim)`. - Split the hidden dimension into `num_heads` attention heads. - Compute scaled dot-product attention. - Support an optional attention mask. - Concatenate heads and apply an output projection. 2. **SwiGLU feed-forward network** - Implement a gated feed-forward layer using the SwiGLU formulation: `SwiGLU(x) = Swish(xW_gate) * (xW_value)`, followed by an output projection. - The gate and value projections should be computed in parallel. - Use `Swish(z) = z * sigmoid(z)`. 3. **Transformer block structure** - Include residual connections. - Include layer normalization. - Return an output tensor with the same shape as the input. Be prepared to explain tensor shapes, masking behavior, and the time and memory complexity of multi-head attention.

Quick Answer: This question evaluates implementation and reasoning skills for Transformer internals, testing competency in multi-head self-attention, SwiGLU gated feed-forward layers, residual connections, and layer normalization within the Machine Learning / deep learning domain.

Implement a Transformer-style block in Python using NumPy. Do not use high-level attention or feed-forward APIs. The block must follow this pre-normalization structure: 1. y = x + MultiHeadSelfAttention(LN1(x)) 2. out = y + SwiGLUFeedForward(LN2(y)) Details: - Input tensor x has shape (batch_size, sequence_length, hidden_dim). - Multi-head self-attention must: - project x into Q, K, V using weight matrices, - split hidden_dim evenly across num_heads, - compute scaled dot-product attention, - apply an optional mask before softmax, - concatenate all heads, - apply the output projection. - The SwiGLU feed-forward network must compute: - gate = x @ W_gate - value = x @ W_value - swish(gate) = gate * sigmoid(gate) - hidden = swish(gate) * value - output = hidden @ W_ffn_out - Layer normalization is applied over the last dimension only. - For deterministic grading, use layer normalization without epsilon: LN(t) = ((t - mean(t)) / sqrt(var(t))) * gamma + beta All test inputs are chosen so var(t) > 0 for every token. - There are no bias terms and no dropout. - The returned tensor must have the same shape as the input and each value must be rounded to 6 decimal places.

Constraints

  • 1 <= batch_size <= 10
  • 1 <= sequence_length <= 30
  • 1 <= hidden_dim <= 64
  • hidden_dim % num_heads == 0
  • w_q, w_k, w_v, w_o are all (hidden_dim x hidden_dim)
  • w_gate and w_value have the same shape (hidden_dim x ffn_dim)
  • w_ffn_out has shape (ffn_dim x hidden_dim)
  • Every token vector has non-zero variance across the hidden dimension
  • If a mask is provided, every query position has at least one unmasked key

Examples

Input: ([[[1, 0], [0, 1]]], 1, [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [1, 1], [0, 0], [1, 1], [0, 0], None)

Expected Output: [[[1.888386, -0.888386], [-0.888386, 1.888386]]]

Explanation: Layer normalization turns the two tokens into [1, -1] and [-1, 1]. With identity Q/K/V/O and one head, each token attends mostly to itself. The feed-forward part is zero, so only the attention residual changes the input.

Input: ([[[1, 0], [0, 1]]], 1, [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [1, 1], [0, 0], [1, 1], [0, 0], [[1, 0], [1, 1]])

Expected Output: [[[2.0, -1.0], [-0.888386, 1.888386]]]

Explanation: The mask blocks token 0 from attending to token 1, so token 0 attends only to itself and adds [1, -1] to the residual. Token 1 can still attend to both positions.

Input: ([[[1, -1]]], 1, [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[0, 0], [0, 0]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [[1, 0], [0, 1]], [1, 1], [0, 0], [1, 1], [0, 0], None)

Expected Output: [[[1.731059, -0.731059]]]

Explanation: Attention is zero because all attention projection matrices are zero. The normalized token is [1, -1], so SwiGLU computes swish([1, -1]) * [1, -1] = [0.731059, 0.268941], which is added back through the residual.

Input: ([[[1, 2, 3, 4]]], 2, [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], [1, 1, 1, 1], [0, 0, 0, 0], [1, 1, 1, 1], [0, 0, 0, 0], None)

Expected Output: [[[-0.341641, 1.552786, 3.447214, 5.341641]]]

Explanation: This checks the multi-head split with hidden_dim=4 and num_heads=2. With a single token, each head attends only to that token, so the attention output is exactly the normalized input, which is added back through the residual.

Hints

  1. After computing Q, K, and V with shape (B, T, D), reshape each to (B, T, H, D/H) and transpose to (B, H, T, D/H) so attention is computed independently per head.
  2. Apply the mask to the attention score matrix before softmax by replacing blocked positions with a very large negative number. For SwiGLU, compute the gate and value projections from the same normalized input, then use swish(gate) * value.
Last updated: May 23, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Find Containing Range - Google (medium)
  • Rearrange Tasks With Cooldown - Google (medium)
  • Solve Three Array and Matrix Path Problems - Google (medium)
  • Consolidate On-Call Rotation Segments - Google (medium)
  • Solve Flower Placement and Directory Deletion - Google (medium)