PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Anthropic

Debug a GRPO training loop and explain ratios

Last updated: Apr 28, 2026

Quick Overview

This question evaluates debugging and implementation knowledge for on-policy reinforcement learning, focusing on GRPO/PPO-style training loops, importance-sampling ratios, log-prob computations, masking, and advantage normalization.

  • medium
  • Anthropic
  • Machine Learning
  • Software Engineer

Debug a GRPO training loop and explain ratios

Company: Anthropic

Role: Software Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

You are given a simplified implementation of a **GRPO (Group Relative Policy Optimization)** training step for an RLHF-style policy model. The training is meant to be **strictly on-policy** — rollouts are generated by the same policy that is being updated — but training is unstable, and you have been asked to walk through the loop, find the implementation bugs, and explain an anomaly in the importance-sampling ratio. This is a discussion-and-debugging question: there is no single "right answer" to recite, but a strong response reasons crisply about the GRPO objective, the mechanics of an autoregressive policy-gradient loop, and the difference between behavior that is *expected by design* and behavior that is an *actual bug*. ### Constraints & Assumptions - The model is an **autoregressive LLM**; the policy $\pi_\theta(a \mid s)$ is the per-token next-token distribution. - GRPO is a critic-free PPO variant: the advantage baseline comes from a **group of completions** sampled for the same prompt, not from a learned value network. - The loop is *intended* to be **strictly on-policy** (the policy that generates rollouts is the one being updated), so the candidate should treat "the ratio should be 1" as the stated expectation and reason about why reality differs. - Assume a realistic modern stack is *possible* but not given — part of the exercise is asking which components are in play (single forward path vs. separate inference/training engines, number of update epochs per rollout batch, sampling settings). ### Clarifying Questions to Ask - Is the reward **outcome-supervised** (one scalar per completion) or **process-supervised** (per-step rewards)? This changes how advantages are broadcast over tokens. - How many **optimizer steps / minibatch epochs** are taken per rollout batch? One step vs. PPO-style multi-epoch fundamentally changes whether the ratio can stay at 1. - Is generation done by the **same forward path** as scoring, or by a separate inference engine (e.g. a fast sampler) with log-probs recomputed in the trainer? - What **sampling settings** are used at rollout time (temperature, top-p / top-k, repetition penalty, logit bias)? Are the same transforms applied when log-probs are recomputed? - Is there a **KL penalty against a frozen reference model**, and is it folded into the reward or kept as a separate loss term? - What is the **precision / attention backend** (bf16 vs fp32, fused vs eager) on each path? ### What a Strong Answer Covers A strong response is graded on these *dimensions* (the bar is *did the candidate address it*, not whether they matched any particular wording): - Articulates the **GRPO objective** and how its baseline differs from PPO's. - Explains the **autoregressive log-prob / token-alignment** mechanics and where alignment errors can creep in. - Reasons about **masking** — which tokens should count toward the loss, KL, and log-prob sums. - Treats advantages with the correct **gradient semantics** and computes the baseline at the right **granularity**. - Separates **"expected by design"** ratio deviations from **"actual staleness / mismatch"** bugs. - For every claimed bug, gives a concrete **detection method** and a concrete **fix** — not just a name. --- ### 1. Walk through the end-to-end GRPO training flow Explain one full training step, covering: - **Sampling prompts** from the dataset. - **Generating rollouts** (a group of completions per prompt). - **Computing group-based advantages** — relative *within a group* of completions for the same prompt. - **Computing the policy gradient loss**. - **Updating the policy**. ```hint Where to start Trace the data through one iteration: prompts in → a *group* of $G$ completions per prompt → a scalar reward per completion → an advantage per completion → a per-token loss → one optimizer step. As you go, label what is fixed (snapshotted at rollout time) vs. what moves during the update. ``` ```hint The defining GRPO idea GRPO drops the critic. Ask yourself: with no learned $V(s)$, where does the **baseline** for the advantage come from? Once you've answered that, push further — what extra statistic of the group does GRPO use beyond the baseline, and what would change if you removed it? ``` ### 2. Find three common, easy-to-miss bugs Training is unstable due to several straightforward implementation issues. Describe **three common, easy-to-miss bugs** in a GRPO/PPO-like training loop that would cause incorrect learning. Candidate areas include: - wrong log-prob computation, - incorrect masking, - advantage-normalization mistakes, - mixing policies. For **each** bug, explain: - **how you would detect it**, and - **how you would fix it**. ```hint The log-prob path In an autoregressive model, which logits position produces the distribution that the token at position $t$ is scored against? Convince yourself the indexing lines up, then think about a cheap sanity check you could run on a freshly sampled batch — what should the ratio equal before you've taken any step? ``` ```hint The other two areas For masking: enumerate the kinds of positions in a padded prompt+completion sequence and ask which of them should actually contribute to the loss, KL, and log-prob sums. For the advantage: think about its role in the gradient (is it something you differentiate through, or a fixed scalar?) and over which set of completions its statistics should be computed. Probe each with a degenerate case to see where it breaks. ``` ### 3. Explain why the importance-sampling ratio deviates from 1 During debugging you notice that the importance-sampling ratio $$ \text{ratio} = \exp(\log \pi_{\theta}(a \mid s) - \log \pi_{\text{old}}(a \mid s)) $$ is **not always 1**, even though you expected the method to be strictly on-policy. Explain **why the ratio might deviate from 1** in practice. List the most likely causes, and for each, state **what to check in the training pipeline** to confirm it. ```hint Frame the invariant first Write down the precise conditions under which `ratio` would be exactly 1 (think: which parameters, which inputs/masks, which forward mode produce `logp_new` and `logp_old`). A deviation means one of those conditions is violated — so your causes are just the distinct ways each condition can break. Sort them into "expected by design" vs. "actual staleness/mismatch." ``` ```hint Two families to consider One family is benign and follows from *how many gradient steps* you take per rollout batch — work out what happens to a snapshotted `logp_old` after the first inner update. The other family arises when the path that *generated* the tokens isn't bit-for-bit the path that *scored* them — brainstorm every way two forward passes on "the same" weights could still disagree (think about the engine, numerics, weight freshness, and any transforms applied only at sampling time). ``` ```hint Turning causes into checks For each cause you list, design a discriminating experiment: what would you recompute or compare, on how many sequences, to confirm or rule it out? Think about what the ratio should look like *before* vs. *after* the first optimizer step, and how the *magnitude* of the deviation helps you tell harmless numerical drift apart from a real mismatch. ``` --- ### Follow-up Questions - How does the **clipped surrogate objective** prevent the (now non-trivial) ratio from destabilizing training, and what does the `min` / clip range actually buy you? - If a group of completions all receive the **same reward**, what advantage do they get, and should that group be skipped or contribute zero gradient? What numerical safeguard prevents a divide-by-zero? - In a real `generate`-then-`train` stack where the engine-vs-trainer log-prob mismatch is unavoidable, what is the principled correction — and why is "just pretend it's on-policy" wrong? - How would your reasoning change for **process supervision** (per-step rewards) versus **outcome supervision** (one scalar per completion)? - Where should the **KL-to-reference** term live — folded into the reward or as a separate loss — and what are the trade-offs?

Quick Answer: This question evaluates debugging and implementation knowledge for on-policy reinforcement learning, focusing on GRPO/PPO-style training loops, importance-sampling ratios, log-prob computations, masking, and advantage normalization.

Related Interview Questions

  • Design a Double Descent Experiment - Anthropic (medium)
  • Explain batch inference design - Anthropic (medium)
  • Implement and derive backprop from scratch - Anthropic (medium)
  • Implement and analyze custom attention - Anthropic (hard)
Anthropic logo
Anthropic
Feb 19, 2026, 12:00 AM
Software Engineer
Technical Screen
Machine Learning
90
0

You are given a simplified implementation of a GRPO (Group Relative Policy Optimization) training step for an RLHF-style policy model. The training is meant to be strictly on-policy — rollouts are generated by the same policy that is being updated — but training is unstable, and you have been asked to walk through the loop, find the implementation bugs, and explain an anomaly in the importance-sampling ratio.

This is a discussion-and-debugging question: there is no single "right answer" to recite, but a strong response reasons crisply about the GRPO objective, the mechanics of an autoregressive policy-gradient loop, and the difference between behavior that is expected by design and behavior that is an actual bug.

Constraints & Assumptions

  • The model is an autoregressive LLM ; the policy πθ(a∣s)\pi_\theta(a \mid s)πθ​(a∣s) is the per-token next-token distribution.
  • GRPO is a critic-free PPO variant: the advantage baseline comes from a group of completions sampled for the same prompt, not from a learned value network.
  • The loop is intended to be strictly on-policy (the policy that generates rollouts is the one being updated), so the candidate should treat "the ratio should be 1" as the stated expectation and reason about why reality differs.
  • Assume a realistic modern stack is possible but not given — part of the exercise is asking which components are in play (single forward path vs. separate inference/training engines, number of update epochs per rollout batch, sampling settings).

Clarifying Questions to Ask

  • Is the reward outcome-supervised (one scalar per completion) or process-supervised (per-step rewards)? This changes how advantages are broadcast over tokens.
  • How many optimizer steps / minibatch epochs are taken per rollout batch? One step vs. PPO-style multi-epoch fundamentally changes whether the ratio can stay at 1.
  • Is generation done by the same forward path as scoring, or by a separate inference engine (e.g. a fast sampler) with log-probs recomputed in the trainer?
  • What sampling settings are used at rollout time (temperature, top-p / top-k, repetition penalty, logit bias)? Are the same transforms applied when log-probs are recomputed?
  • Is there a KL penalty against a frozen reference model , and is it folded into the reward or kept as a separate loss term?
  • What is the precision / attention backend (bf16 vs fp32, fused vs eager) on each path?

What a Strong Answer Covers

A strong response is graded on these dimensions (the bar is did the candidate address it, not whether they matched any particular wording):

  • Articulates the GRPO objective and how its baseline differs from PPO's.
  • Explains the autoregressive log-prob / token-alignment mechanics and where alignment errors can creep in.
  • Reasons about masking — which tokens should count toward the loss, KL, and log-prob sums.
  • Treats advantages with the correct gradient semantics and computes the baseline at the right granularity .
  • Separates "expected by design" ratio deviations from "actual staleness / mismatch" bugs.
  • For every claimed bug, gives a concrete detection method and a concrete fix — not just a name.

1. Walk through the end-to-end GRPO training flow

Explain one full training step, covering:

  • Sampling prompts from the dataset.
  • Generating rollouts (a group of completions per prompt).
  • Computing group-based advantages — relative within a group of completions for the same prompt.
  • Computing the policy gradient loss .
  • Updating the policy .

2. Find three common, easy-to-miss bugs

Training is unstable due to several straightforward implementation issues. Describe three common, easy-to-miss bugs in a GRPO/PPO-like training loop that would cause incorrect learning. Candidate areas include:

  • wrong log-prob computation,
  • incorrect masking,
  • advantage-normalization mistakes,
  • mixing policies.

For each bug, explain:

  • how you would detect it , and
  • how you would fix it .

3. Explain why the importance-sampling ratio deviates from 1

During debugging you notice that the importance-sampling ratio

ratio=exp⁡(log⁡πθ(a∣s)−log⁡πold(a∣s))\text{ratio} = \exp(\log \pi_{\theta}(a \mid s) - \log \pi_{\text{old}}(a \mid s))ratio=exp(logπθ​(a∣s)−logπold​(a∣s))

is not always 1, even though you expected the method to be strictly on-policy.

Explain why the ratio might deviate from 1 in practice. List the most likely causes, and for each, state what to check in the training pipeline to confirm it.

Follow-up Questions

  • How does the clipped surrogate objective prevent the (now non-trivial) ratio from destabilizing training, and what does the min / clip range actually buy you?
  • If a group of completions all receive the same reward , what advantage do they get, and should that group be skipped or contribute zero gradient? What numerical safeguard prevents a divide-by-zero?
  • In a real generate -then- train stack where the engine-vs-trainer log-prob mismatch is unavoidable, what is the principled correction — and why is "just pretend it's on-policy" wrong?
  • How would your reasoning change for process supervision (per-step rewards) versus outcome supervision (one scalar per completion)?
  • Where should the KL-to-reference term live — folded into the reward or as a separate loss — and what are the trade-offs?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic Machine Learning•Software Engineer Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.