You are given a pretrained image diffusion model that generates images conditioned on text prompts (e.g., a text-to-image model). You now want to fine-tune this model using reinforcement learning with a GRPO-style (Group-Relative Policy Optimization) objective to better match a scalar reward signal (such as a learned preference model, a CLIP-based score, or some task-specific reward).
The interviewer asks:
"Describe how you would set up and implement a GRPO-style training loop to fine-tune a diffusion model. In particular:
- How do you define states, actions, and rewards in this RL setting?
- How do you sample trajectories and compute advantages for GRPO?
- What loss/objective do you optimize, and how does it relate to policy gradients?
- Give high-level pseudocode for one training iteration."
Assume:
Explain the design and reasoning step-by-step, being explicit about how GRPO differs from a basic REINFORCE-style policy gradient.
Login required