Implement attention and Transformer with backward pass

Q: Implement attention and Transformer with backward pass

This question evaluates implementation and analytical skills for scaled dot-product multi-head self-attention and an encoder-style Transformer block, including manual forward and backward-pass computations, gradient derivation for projection matrices, causal masking for autoregressive next-token prediction, numerical stability of softmax, and time/memory complexity analysis within the Machine Learning domain, emphasizing practical implementation with underlying conceptual understanding of linear algebra and backpropagation mechanics. It is commonly asked to verify mastery of attention mechanisms and autodiff-free gradient reasoning, plus the ability to compute cross-entropy/perplexity, perform finite-difference gradient checks, and reason about algorithmic trade-offs.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Implement Scaled Dot-Product Attention and a Transformer Block (No Autograd)

Context: Build multi-head self-attention and a Transformer encoder-style block from scratch for autoregressive next-token prediction. Implement both forward and backward passes manually (no autograd). A numerically stable softmax and its backward are provided.

Assume:

Inputs: X ∈ R^{B×T×d_model}
Multi-head attention with h heads; head_dim d_h = d_model / h
Learnable parameters: W_q, W_k, W_v, W_o ∈ R^{d_model×d_model}
Causal mask M ∈ R^{T×T}, where M[i, j] = 0 if j ≤ i, and M[i, j] = −∞ otherwise
Provided: softmax(z, axis=-1) and softmax_backward(dY, Y) that maps dY = ∂L/∂softmax(z) to dZ = ∂L/∂z

Tasks

Forward

Compute Q, K, V, split into heads, scores = Q K^T / sqrt(d_h)
Apply causal mask, softmax, attention output, merge heads, apply W_o
If building a full block, add Residual connections, LayerNorm, and position-wise Feed-Forward sublayer

Backward

Derive and implement gradients for W_q, W_k, W_v, W_o and inputs dX
Include gradients through scaling, matmuls, masking, head reshaping/transposes, and output projection
Use the provided softmax backward to map dScores to dLogits

Autoregressive Training

Define next-token cross-entropy loss with teacher forcing; show where causal mask is applied
Compute perplexity

Verification and Analysis

Implement finite-difference gradient checks on small tensors; report maximum relative errors
Discuss numerical stability (e.g., stabilized softmax with log-sum-exp)
Provide clear function signatures and tensor shapes at each step
Include time and memory complexity analysis

Implement attention and Transformer with backward pass

Implement Scaled Dot-Product Attention and a Transformer Block (No Autograd)

Tasks

Solution

Comments (0)

Implement attention and Transformer with backward pass

Overview

Implement Scaled Dot-Product Attention and a Transformer Block (No Autograd)

Tasks

Solution

Comments (0)