Implement attention and Transformer with backward pass

Q: Implement attention and Transformer with backward pass

This is a Machine Learning interview question from Tesla for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Implement Scaled Dot-Product Attention and a Transformer Block (No Autograd)

Context: Build multi-head self-attention and a Transformer encoder-style block from scratch for autoregressive next-token prediction. Implement both forward and backward passes manually (no autograd). A numerically stable softmax and its backward are provided.

Assume:

Inputs: X ∈ R^{B×T×d_model}
Multi-head attention with h heads; head_dim d_h = d_model / h
Learnable parameters: W_q, W_k, W_v, W_o ∈ R^{d_model×d_model}
Causal mask M ∈ R^{T×T}, where M[i, j] = 0 if j ≤ i, and M[i, j] = −∞ otherwise
Provided: softmax(z, axis=-1) and softmax_backward(dY, Y) that maps dY = ∂L/∂softmax(z) to dZ = ∂L/∂z

Tasks

Forward

Compute Q, K, V, split into heads, scores = Q K^T / sqrt(d_h)
Apply causal mask, softmax, attention output, merge heads, apply W_o
If building a full block, add Residual connections, LayerNorm, and position-wise Feed-Forward sublayer

Backward

Derive and implement gradients for W_q, W_k, W_v, W_o and inputs dX
Include gradients through scaling, matmuls, masking, head reshaping/transposes, and output projection
Use the provided softmax backward to map dScores to dLogits

Autoregressive Training

Define next-token cross-entropy loss with teacher forcing; show where causal mask is applied
Compute perplexity

Verification and Analysis

Implement finite-difference gradient checks on small tensors; report maximum relative errors
Discuss numerical stability (e.g., stabilized softmax with log-sum-exp)
Provide clear function signatures and tensor shapes at each step
Include time and memory complexity analysis

Implement attention and Transformer with backward pass

Implement Scaled Dot-Product Attention and a Transformer Block (No Autograd)

Tasks

Solution (Locked)

Comments (0)