ML Fundamentals: Backprop, Attention, And RL
Asked of: Software Engineer
Last updated

What's being tested
This area tests whether you can implement and reason about machine learning primitives at the level needed to build reliable model-facing software: gradients, log-probabilities, tensor shapes, masking, numerical stability, and training-loop invariants. For a Software Engineer, the bar is not inventing new architectures; it is being able to read, modify, debug, and productionize ML code without silently changing model behavior. Anthropic cares because small engineering mistakes in attention masks, log-prob aggregation, or policy-gradient ratios can create large downstream failures: broken evaluations, unstable training, incorrect classifier scores, or wasted accelerator time. Interviewers are probing for a combination of mathematical literacy, implementation discipline, and debugging judgment.
Core knowledge
-
Backpropagation is repeated application of the chain rule over a computation graph. For parameter , compute by caching forward intermediates, propagating upstream gradients backward, and matching every gradient tensor’s shape to its corresponding activation or parameter.
-
Binary cross-entropy should be implemented from logits, not probabilities, for stability. Prefer
BCEWithLogitsLoss-style math:
This avoids overflow fromsigmoid(x)when is large. -
Gradient checking compares analytic gradients to finite differences:
Usefloat64, small networks, and . Expect relative error around1e-6to1e-4; worse often means a broadcasting or reduction bug. -
Scaled dot-product attention computes
where encodes masks. Scaling by prevents logits from growing with embedding dimension and saturating thesoftmax. -
Masking semantics are a common failure point. A causal mask prevents token from attending to future positions ; a padding mask prevents attending to non-real tokens. In
PyTorch, masks must broadcast correctly across batch and head dimensions. -
Numerically stable softmax subtracts the row-wise maximum before exponentiation:
softmax(x) = exp(x - max(x)) / sum(exp(x - max(x))). In mixed precision, usefloat32accumulation for attention scores when possible, especially beforesoftmax. -
Log-probabilities should be aggregated in log space. For binary classification using an LLM, compare candidate label sequences using sums of token log-probs, and normalize if label strings have different lengths. Use log-sum-exp for stable probability normalization.
-
Calibration separates ranking from probability quality. An LLM may assign higher log-prob to the correct class but produce poorly calibrated probabilities. Simple post-hoc methods such as temperature scaling can improve probability estimates without changing the underlying rank order.
-
On-policy reinforcement learning loops require log-probs from the same policy distribution that generated the sampled responses, or carefully bounded importance sampling. In PPO/GRPO-style code, stale samples, recomputed masks, or mismatched tokenization can corrupt ratios.
-
Importance-sampling ratios usually take the form
Compute in log space, clamp or clip only where the algorithm specifies, and inspect ratio histograms for explosions or collapse. -
Advantage normalization improves training stability but must respect grouping and masking semantics. In GRPO-style setups, advantages may be normalized within a group of completions for the same prompt; including padding tokens or mixing unrelated prompts changes the learning signal.
-
Testing ML code requires shape tests, numerical tests, and invariance tests. For attention, test all-masked rows, single-token sequences, causal behavior, padding behavior, and parity with
torch.nn.functional.scaled_dot_product_attentionwhere applicable.
Worked example
For “Implement and analyze custom attention”, a strong candidate starts by clarifying tensor layout: “I’ll assume Q, K, and V have shape [batch, heads, seq_len, head_dim], and I’ll return [batch, heads, query_len, head_dim].” They should ask whether the mask is causal, padding, or both, and whether the implementation must support mixed precision. The answer can be organized into four pillars: compute scores with Q @ K.transpose(-2, -1), scale by sqrt(head_dim), apply masks before softmax, then multiply by V. They should explicitly say that masks should use a large negative additive value or boolean masking, not multiplication by zero, because zeroed logits still receive attention probability. A good implementation discussion includes shape broadcasting, e.g. converting a padding mask from [batch, key_len] to [batch, 1, 1, key_len]. The candidate should flag numerical stability by subtracting the max implicitly via torch.softmax and being careful with -inf in half precision. One tradeoff to mention is readability versus performance: a clean tensorized implementation is acceptable for an interview, while production might use fused kernels such as FlashAttention or scaled_dot_product_attention. A strong close would be: “If I had more time, I’d add tests comparing against a reference implementation across causal masks, padding masks, all-padding edge cases, and float16/bfloat16 behavior.”
A second angle
For “Debug a GRPO training loop and explain ratios”, the same fundamentals show up as log-prob accounting rather than attention math. The core invariant is that token-level log-probs, masks, rewards, and advantages must align over exactly the generated response tokens, not prompt tokens or padding. Instead of asking “does the tensor multiply work,” the interviewer is testing whether you can identify silent training bugs: recomputing old_logprobs with the new model, normalizing advantages across the wrong group, or including masked tokens in the loss. The importance ratio is mathematically simple but operationally fragile because tokenization, batching, and masking must be identical. The best answers combine formulas with concrete debug checks: print shapes, assert mask sums, inspect ratio distributions, and run a tiny deterministic batch.
Common pitfalls
Pitfall: Treating ML primitives as black-box library calls.
Saying “I’d just use torch.autograd” or “I’d call nn.MultiheadAttention” misses the point. It is fine to use libraries in production, but in the interview you need to show that you understand the derivative, masking, or log-prob calculation well enough to catch bad outputs.
Pitfall: Ignoring numerical stability.
Wrong-but-tempting answers include computing log(sigmoid(x)) directly, applying softmax before masking, or exponentiating raw log-prob differences without checking range. Better answers keep values in logit or log-prob space, use logsumexp, subtract maxima before normalization, and inspect for NaN/inf.
Pitfall: Communicating only equations or only code.
A purely mathematical derivation can miss software failure modes like broadcasting bugs, dtype mismatches, and off-by-one causal masks. A purely code-level answer can miss why the code is correct. The strongest answers alternate between invariant, formula, tensor shape, and test.
Connections
Interviewers may pivot from here into model serving, especially batching, latency, and memory tradeoffs for LLM inference. They may also connect to distributed training, including gradient accumulation, mixed precision, and checkpointing, or to evaluation infrastructure, where classifier calibration and log-prob scoring become product-facing reliability concerns.
Further reading
-
Deep Learning by Goodfellow, Bengio, and Courville — strong reference for backpropagation, optimization, and numerical foundations.
-
Attention Is All You Need — original Transformer paper; useful for understanding scaled dot-product attention and masking.
-
Proximal Policy Optimization Algorithms — canonical source for PPO-style ratios, clipping, and policy-gradient stability.
Featured in interview prep guides
Practice questions
- Debug a GRPO training loop and explain ratiosAnthropic · Software Engineer · Technical Screen · medium
- Implement and derive backprop from scratchAnthropic · Software Engineer · Onsite · medium
- Design an LLM-based binary classifierAnthropic · Software Engineer · Technical Screen · medium
- Implement and analyze custom attentionAnthropic · Software Engineer · Onsite · hard
Related concepts
- Supervised ML Fundamentals, Evaluation And Feature EngineeringMachine Learning
- ML System Design, Recommenders, Forecasting And AllocationMachine Learning
- Supervised ML, Imbalance, Overfitting, And OptimizationMachine Learning
- ML Frameworks, Model Compilation, And ParallelismML System Design
- ML Model Evaluation, Metrics, And ExperimentationML System Design
- RLHF And Preference Optimization Basics