Debug a GRPO training loop and explain ratios

Q: Debug a GRPO training loop and explain ratios

This question evaluates debugging and implementation knowledge for on-policy reinforcement learning, focusing on GRPO/PPO-style training loops, importance-sampling ratios, log-prob computations, masking, and advantage normalization.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Q: What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Technical Screen rounds at Anthropic.

Q: What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Question

You are given a simplified implementation of a GRPO (Group Relative Policy Optimization) training step for an RLHF-style policy model. The training is supposed to be strictly on-policy, meaning rollouts are generated by the same policy that is being updated.

Tasks:

Walk through the end-to-end GRPO training flow : sampling prompts, generating rollouts, computing group-based advantages (relative within a group of completions), computing the policy gradient loss, and updating the policy.
You find that training is unstable due to several straightforward implementation issues. Describe three common, easy-to-miss bugs in a GRPO/PPO-like training loop that would cause incorrect learning (e.g., wrong log-prob computation, incorrect masking, advantage normalization mistakes, mixing policies, etc.). For each, explain how you would detect it and how to fix it.
During debugging you notice the importance-sampling ratio

$\text{ratio} = \exp(\log \pi_{\theta}(a|s) - \log \pi_{\text{old}}(a|s))$

is not always 1, even though you expected the method to be strictly on-policy. Explain why the ratio might deviate from 1 in practice. List the most likely causes and what to check in the training pipeline to confirm each cause.

Debug a GRPO training loop and explain ratios

Quick Overview

Solution

Comments (0)