This question evaluates debugging and implementation knowledge for on-policy reinforcement learning, focusing on GRPO/PPO-style training loops, importance-sampling ratios, log-prob computations, masking, and advantage normalization.
You are given a simplified implementation of a GRPO (Group Relative Policy Optimization) training step for an RLHF-style policy model. The training is supposed to be strictly on-policy, meaning rollouts are generated by the same policy that is being updated.
Tasks:
is not always 1, even though you expected the method to be strictly on-policy. Explain why the ratio might deviate from 1 in practice. List the most likely causes and what to check in the training pipeline to confirm each cause.