Debug a GRPO training loop and explain ratios
Company: Anthropic
Role: Software Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
You are given a simplified implementation of a **GRPO (Group Relative Policy Optimization)** training step for an RLHF-style policy model. The training is meant to be **strictly on-policy** — rollouts are generated by the same policy that is being updated — but training is unstable, and you have been asked to walk through the loop, find the implementation bugs, and explain an anomaly in the importance-sampling ratio.
This is a discussion-and-debugging question: there is no single "right answer" to recite, but a strong response reasons crisply about the GRPO objective, the mechanics of an autoregressive policy-gradient loop, and the difference between behavior that is *expected by design* and behavior that is an *actual bug*.
### Constraints & Assumptions
- The model is an **autoregressive LLM**; the policy $\pi_\theta(a \mid s)$ is the per-token next-token distribution.
- GRPO is a critic-free PPO variant: the advantage baseline comes from a **group of completions** sampled for the same prompt, not from a learned value network.
- The loop is *intended* to be **strictly on-policy** (the policy that generates rollouts is the one being updated), so the candidate should treat "the ratio should be 1" as the stated expectation and reason about why reality differs.
- Assume a realistic modern stack is *possible* but not given — part of the exercise is asking which components are in play (single forward path vs. separate inference/training engines, number of update epochs per rollout batch, sampling settings).
### Clarifying Questions to Ask
- Is the reward **outcome-supervised** (one scalar per completion) or **process-supervised** (per-step rewards)? This changes how advantages are broadcast over tokens.
- How many **optimizer steps / minibatch epochs** are taken per rollout batch? One step vs. PPO-style multi-epoch fundamentally changes whether the ratio can stay at 1.
- Is generation done by the **same forward path** as scoring, or by a separate inference engine (e.g. a fast sampler) with log-probs recomputed in the trainer?
- What **sampling settings** are used at rollout time (temperature, top-p / top-k, repetition penalty, logit bias)? Are the same transforms applied when log-probs are recomputed?
- Is there a **KL penalty against a frozen reference model**, and is it folded into the reward or kept as a separate loss term?
- What is the **precision / attention backend** (bf16 vs fp32, fused vs eager) on each path?
### What a Strong Answer Covers
A strong response is graded on these *dimensions* (the bar is *did the candidate address it*, not whether they matched any particular wording):
- Articulates the **GRPO objective** and how its baseline differs from PPO's.
- Explains the **autoregressive log-prob / token-alignment** mechanics and where alignment errors can creep in.
- Reasons about **masking** — which tokens should count toward the loss, KL, and log-prob sums.
- Treats advantages with the correct **gradient semantics** and computes the baseline at the right **granularity**.
- Separates **"expected by design"** ratio deviations from **"actual staleness / mismatch"** bugs.
- For every claimed bug, gives a concrete **detection method** and a concrete **fix** — not just a name.
---
### 1. Walk through the end-to-end GRPO training flow
Explain one full training step, covering:
- **Sampling prompts** from the dataset.
- **Generating rollouts** (a group of completions per prompt).
- **Computing group-based advantages** — relative *within a group* of completions for the same prompt.
- **Computing the policy gradient loss**.
- **Updating the policy**.
```hint Where to start
Trace the data through one iteration: prompts in → a *group* of $G$ completions per prompt → a scalar reward per completion → an advantage per completion → a per-token loss → one optimizer step. As you go, label what is fixed (snapshotted at rollout time) vs. what moves during the update.
```
```hint The defining GRPO idea
GRPO drops the critic. Ask yourself: with no learned $V(s)$, where does the **baseline** for the advantage come from? Once you've answered that, push further — what extra statistic of the group does GRPO use beyond the baseline, and what would change if you removed it?
```
### 2. Find three common, easy-to-miss bugs
Training is unstable due to several straightforward implementation issues. Describe **three common, easy-to-miss bugs** in a GRPO/PPO-like training loop that would cause incorrect learning. Candidate areas include:
- wrong log-prob computation,
- incorrect masking,
- advantage-normalization mistakes,
- mixing policies.
For **each** bug, explain:
- **how you would detect it**, and
- **how you would fix it**.
```hint The log-prob path
In an autoregressive model, which logits position produces the distribution that the token at position $t$ is scored against? Convince yourself the indexing lines up, then think about a cheap sanity check you could run on a freshly sampled batch — what should the ratio equal before you've taken any step?
```
```hint The other two areas
For masking: enumerate the kinds of positions in a padded prompt+completion sequence and ask which of them should actually contribute to the loss, KL, and log-prob sums. For the advantage: think about its role in the gradient (is it something you differentiate through, or a fixed scalar?) and over which set of completions its statistics should be computed. Probe each with a degenerate case to see where it breaks.
```
### 3. Explain why the importance-sampling ratio deviates from 1
During debugging you notice that the importance-sampling ratio
$$
\text{ratio} = \exp(\log \pi_{\theta}(a \mid s) - \log \pi_{\text{old}}(a \mid s))
$$
is **not always 1**, even though you expected the method to be strictly on-policy.
Explain **why the ratio might deviate from 1** in practice. List the most likely causes, and for each, state **what to check in the training pipeline** to confirm it.
```hint Frame the invariant first
Write down the precise conditions under which `ratio` would be exactly 1 (think: which parameters, which inputs/masks, which forward mode produce `logp_new` and `logp_old`). A deviation means one of those conditions is violated — so your causes are just the distinct ways each condition can break. Sort them into "expected by design" vs. "actual staleness/mismatch."
```
```hint Two families to consider
One family is benign and follows from *how many gradient steps* you take per rollout batch — work out what happens to a snapshotted `logp_old` after the first inner update. The other family arises when the path that *generated* the tokens isn't bit-for-bit the path that *scored* them — brainstorm every way two forward passes on "the same" weights could still disagree (think about the engine, numerics, weight freshness, and any transforms applied only at sampling time).
```
```hint Turning causes into checks
For each cause you list, design a discriminating experiment: what would you recompute or compare, on how many sequences, to confirm or rule it out? Think about what the ratio should look like *before* vs. *after* the first optimizer step, and how the *magnitude* of the deviation helps you tell harmless numerical drift apart from a real mismatch.
```
---
### Follow-up Questions
- How does the **clipped surrogate objective** prevent the (now non-trivial) ratio from destabilizing training, and what does the `min` / clip range actually buy you?
- If a group of completions all receive the **same reward**, what advantage do they get, and should that group be skipped or contribute zero gradient? What numerical safeguard prevents a divide-by-zero?
- In a real `generate`-then-`train` stack where the engine-vs-trainer log-prob mismatch is unavoidable, what is the principled correction — and why is "just pretend it's on-policy" wrong?
- How would your reasoning change for **process supervision** (per-step rewards) versus **outcome supervision** (one scalar per completion)?
- Where should the **KL-to-reference** term live — folded into the reward or as a separate loss — and what are the trade-offs?
Quick Answer: This question evaluates debugging and implementation knowledge for on-policy reinforcement learning, focusing on GRPO/PPO-style training loops, importance-sampling ratios, log-prob computations, masking, and advantage normalization.