You are given a simplified implementation of a GRPO (Group Relative Policy Optimization) training step for an RLHF-style policy model. The training is supposed to be strictly on-policy, meaning rollouts are generated by the same policy that is being updated.
Tasks:
-
Walk through the
end-to-end GRPO training flow
: sampling prompts, generating rollouts, computing group-based advantages (relative within a group of completions), computing the policy gradient loss, and updating the policy.
-
You find that training is unstable due to several straightforward implementation issues. Describe
three common, easy-to-miss bugs
in a GRPO/PPO-like training loop that would cause incorrect learning (e.g., wrong log-prob computation, incorrect masking, advantage normalization mistakes, mixing policies, etc.). For each, explain how you would detect it and how to fix it.
-
During debugging you notice the importance-sampling ratio
ratio=exp(logπθ(a∣s)−logπold(a∣s))
is not always 1, even though you expected the method to be strictly on-policy. Explain why the ratio might deviate from 1 in practice. List the most likely causes and what to check in the training pipeline to confirm each cause.