Transformer Self-Attention and Backpropagation

What's being tested

Interviewers probe whether you can reason about and implement Transformer-style self-attention end-to-end: the algebra of the forward pass, the chain-rule derivation for gradients through the scaled dot-product and softmax, and practical training tradeoffs (memory, numerical stability, batching, and multi-head projection gradients). Tesla cares because production models must be accurate, efficient, and debuggable — you’ll be expected to implement custom ops, diagnose bad gradients, and choose execution strategies that meet latency and memory constraints.

Core knowledge

Scaled dot-product attention formula: $\text{Att}(Q,K,V)=\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.$ Know shapes: if $Q\in\mathbb{R}^{B\times N\times d_k}$ , $K,V\in\mathbb{R}^{B\times N\times d_k}$ , output is $B\times N\times d_k$ .
softmax derivative: for vector z, $\frac{\partial \ell}{\partial z_i}=\sum_j \frac{\partial \ell}{\partial s_j} (s_i(\delta_{ij}-s_j))$ where $s=\text{softmax}(z)$ ; compute via stable log-softmax where possible.
Multi-head projection: queries/keys/values use learned linear maps $W_Q,W_K,W_V\in\mathbb{R}^{d_{model}\times d_k}$ ; gradients flow into both the attention weights and these projections — remember to sum heads' gradient contributions into shared upstream parameters.
Computational cost: full attention costs $O(B N^2 d)$ FLOPs and $O(B N^2)$ memory for the attention matrix; practical limits: $N\lesssim2$ –4k per-GPU for dense attention without specialized kernels.
Numerical stability & masking: apply $(QK^\top)/\sqrt{d_k}$ scaling, subtract max per-row before softmax, and support causal or padding masks to set logits to $-\infty$ (or a large negative) before softmax.
Backprop ordering: compute grads in this order: grad w.r.t. output→grad w.r.t. V and attention weights A→grad w.r.t. logits (via softmax Jacobian)→grad w.r.t. Q and K→grad w.r.t. projection weights; reuse intermediates to reduce recomputation.
Memory-saving strategies: use gradient checkpointing, fused kernels (FlashAttention), or block-sparse/locality-sensitive hashing attention to trade compute for memory; mixed-precision (float16/bfloat16) reduces memory but needs loss_scaling.
Residuals and normalization: typical encoder block = Attention→Add(residual)→LayerNorm→FFN→Add→LayerNorm; gradients pass through residual paths — forgetting to include residual grads yields wrong parameter updates.
Testing gradients: use finite-difference checks on small inputs (double precision) to validate analytical gradients and confirm correct broadcasting and shape assumptions before scaling.
Optimization and regularization: prefer AdamW with weight decay on projection weights, attention dropout on the attention probabilities, and label smoothing for classification heads; monitor gradient norms and learning-rate warmup to avoid early divergence.
Hardware and batching: implement batch-major layouts B,N,C for CUDA; fuse projection + split heads where possible to reduce memory copies; be aware that XLA/TensorRT transformations may change numerical behavior slightly.

Worked example — Implement attention and Transformer with backward pass

First 30 seconds: ask and declare shapes (B,N,d_model; d_k=d_model/h), whether gradients for W_Q,W_K,W_V,W_O are required, and whether causal masking and dropout are enabled. Organize the response into forward-pass steps (linear projections → reshape and transpose to heads → compute logits and masked softmax → weighted sum with V → concat heads → final linear), backward-pass chain (grad-out → grad-V and grad-A → backprop through softmax to logits → split into grad-Q and grad-K → accumulate into projection weights), and implementation/perf choices (naïve vs fused kernels). Flag the important tradeoff: autograd gives correctness quickly but may blow memory — propose checkpointing or FlashAttention if N or batch is large. Close with testing plan and validations: finite-diff gradient checks on small B,N and unit tests for masking and numerical stability; “if I had more time, I’d implement a fused CUDA kernel and compare numerical error vs PyTorch autograd.”

A second angle — Compare RNNs, LSTMs, Transformers, and MPC

Reframe to justify architecture choice: RNN/LSTM require BPTT (backpropagation through time) with $O(N)$ sequential dependency and risk of vanishing/exploding gradients, whereas self-attention provides $O(1)$ sequential depth and direct gradient paths across long ranges at cost $O(N^2)$ memory/time. For latency-constrained inference (e.g., streaming sensor inputs), RNNs or chunked/causal attention with state caching reduce compute; for batch training with long-range context, Transformers scale better with parallel hardware. Model Predictive Control (MPC) is a control-layer strategy, not a sequence model; compare it on closed-loop control stability and explicit constraints versus learned policy outputs from sequence models. A strong answer weighs gradient stability, training parallelism, inference latency, and real-time constraints.

Common pitfalls

Pitfall: Missing the 1/√d_k scaling or the max-subtraction before softmax — this causes extremely small gradients or numerical overflow and will often manifest as NaNs during training.

Pitfall: Assuming shapes without asking batch and head layout — many bugs stem from incorrect reshape/transpose orders or broadcasting errors when accumulating head gradients.

Pitfall: Treating autograd as sufficient for production — while correct, it may be infeasible memory-wise; failing to propose checkpointing, fused kernels, or mixed precision during design is a depth mistake.

Connections

The interviewer may pivot to sparse attention (BigBird, Longformer), efficient kernels like FlashAttention, or hardware-aware optimizations (mixed precision, tensor cores, XLA). They may also ask about positional encodings and how they affect gradient flow or about convergence dynamics (warmup schedules, Adam vs SGD).

What's being tested

Core knowledge

Worked example — Implement attention and Transformer with backward pass

A second angle — Compare RNNs, LSTMs, Transformers, and MPC

Common pitfalls

Connections

Further reading

Practice questions

Related concepts