ML Fundamentals: Backprop, Attention, And RL

What's being tested

This area tests whether you can implement and reason about machine learning primitives at the level needed to build reliable model-facing software: gradients, log-probabilities, tensor shapes, masking, numerical stability, and training-loop invariants. For a Software Engineer, the bar is not inventing new architectures; it is being able to read, modify, debug, and productionize ML code without silently changing model behavior. Anthropic cares because small engineering mistakes in attention masks, log-prob aggregation, or policy-gradient ratios can create large downstream failures: broken evaluations, unstable training, incorrect classifier scores, or wasted accelerator time. Interviewers are probing for a combination of mathematical literacy, implementation discipline, and debugging judgment.

Core knowledge

Backpropagation is repeated application of the chain rule over a computation graph. For parameter $\theta$ , compute $\frac{\partial L}{\partial \theta}$ by caching forward intermediates, propagating upstream gradients backward, and matching every gradient tensor’s shape to its corresponding activation or parameter.
Binary cross-entropy should be implemented from logits, not probabilities, for stability. Prefer BCEWithLogitsLoss-style math:
$L(x,y)=\max(x,0)-xy+\log(1+\exp(-|x|))$
This avoids overflow from sigmoid(x) when $|x|$ is large.
Gradient checking compares analytic gradients to finite differences:
$\frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}$
Use float64, small networks, and $\epsilon \approx 10^{-5}$ . Expect relative error around 1e-6 to 1e-4; worse often means a broadcasting or reduction bug.
Scaled dot-product attention computes
$\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$
where $M$ encodes masks. Scaling by $\sqrt{d_k}$ prevents logits from growing with embedding dimension and saturating the softmax.
Masking semantics are a common failure point. A causal mask prevents token $i$ from attending to future positions $j>i$ ; a padding mask prevents attending to non-real tokens. In PyTorch, masks must broadcast correctly across batch and head dimensions.
Numerically stable softmax subtracts the row-wise maximum before exponentiation: softmax(x) = exp(x - max(x)) / sum(exp(x - max(x))). In mixed precision, use float32 accumulation for attention scores when possible, especially before softmax.
Log-probabilities should be aggregated in log space. For binary classification using an LLM, compare candidate label sequences using sums of token log-probs, and normalize if label strings have different lengths. Use log-sum-exp for stable probability normalization.
Calibration separates ranking from probability quality. An LLM may assign higher log-prob to the correct class but produce poorly calibrated probabilities. Simple post-hoc methods such as temperature scaling can improve probability estimates without changing the underlying rank order.
On-policy reinforcement learning loops require log-probs from the same policy distribution that generated the sampled responses, or carefully bounded importance sampling. In PPO/GRPO-style code, stale samples, recomputed masks, or mismatched tokenization can corrupt ratios.
Importance-sampling ratios usually take the form
$r_t(\theta)=\exp(\log \pi_\theta(a_t|s_t)-\log \pi_{\text{old}}(a_t|s_t))$
Compute in log space, clamp or clip only where the algorithm specifies, and inspect ratio histograms for explosions or collapse.
Advantage normalization improves training stability but must respect grouping and masking semantics. In GRPO-style setups, advantages may be normalized within a group of completions for the same prompt; including padding tokens or mixing unrelated prompts changes the learning signal.
Testing ML code requires shape tests, numerical tests, and invariance tests. For attention, test all-masked rows, single-token sequences, causal behavior, padding behavior, and parity with torch.nn.functional.scaled_dot_product_attention where applicable.

Worked example

For “Implement and analyze custom attention”, a strong candidate starts by clarifying tensor layout: “I’ll assume Q, K, and V have shape [batch, heads, seq_len, head_dim], and I’ll return [batch, heads, query_len, head_dim].” They should ask whether the mask is causal, padding, or both, and whether the implementation must support mixed precision. The answer can be organized into four pillars: compute scores with Q @ K.transpose(-2, -1), scale by sqrt(head_dim), apply masks before softmax, then multiply by V. They should explicitly say that masks should use a large negative additive value or boolean masking, not multiplication by zero, because zeroed logits still receive attention probability. A good implementation discussion includes shape broadcasting, e.g. converting a padding mask from [batch, key_len] to [batch, 1, 1, key_len]. The candidate should flag numerical stability by subtracting the max implicitly via torch.softmax and being careful with -inf in half precision. One tradeoff to mention is readability versus performance: a clean tensorized implementation is acceptable for an interview, while production might use fused kernels such as FlashAttention or scaled_dot_product_attention. A strong close would be: “If I had more time, I’d add tests comparing against a reference implementation across causal masks, padding masks, all-padding edge cases, and float16/bfloat16 behavior.”

A second angle

For “Debug a GRPO training loop and explain ratios”, the same fundamentals show up as log-prob accounting rather than attention math. The core invariant is that token-level log-probs, masks, rewards, and advantages must align over exactly the generated response tokens, not prompt tokens or padding. Instead of asking “does the tensor multiply work,” the interviewer is testing whether you can identify silent training bugs: recomputing old_logprobs with the new model, normalizing advantages across the wrong group, or including masked tokens in the loss. The importance ratio $\exp(\log p_{\text{new}}-\log p_{\text{old}})$ is mathematically simple but operationally fragile because tokenization, batching, and masking must be identical. The best answers combine formulas with concrete debug checks: print shapes, assert mask sums, inspect ratio distributions, and run a tiny deterministic batch.

Common pitfalls

Pitfall: Treating ML primitives as black-box library calls.

Saying “I’d just use torch.autograd” or “I’d call nn.MultiheadAttention” misses the point. It is fine to use libraries in production, but in the interview you need to show that you understand the derivative, masking, or log-prob calculation well enough to catch bad outputs.

Pitfall: Ignoring numerical stability.

Wrong-but-tempting answers include computing log(sigmoid(x)) directly, applying softmax before masking, or exponentiating raw log-prob differences without checking range. Better answers keep values in logit or log-prob space, use logsumexp, subtract maxima before normalization, and inspect for NaN/inf.

Pitfall: Communicating only equations or only code.

A purely mathematical derivation can miss software failure modes like broadcasting bugs, dtype mismatches, and off-by-one causal masks. A purely code-level answer can miss why the code is correct. The strongest answers alternate between invariant, formula, tensor shape, and test.

Connections

Interviewers may pivot from here into model serving, especially batching, latency, and memory tradeoffs for LLM inference. They may also connect to distributed training, including gradient accumulation, mixed precision, and checkpointing, or to evaluation infrastructure, where classifier calibration and log-prob scoring become product-facing reliability concerns.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts