Debug a GRPO training loop and explain ratios | Anthropic Interview Question