You are given a pretrained image diffusion model that generates images conditioned on text prompts (e.g., a text-to-image model). You now want to fine-tune this model using reinforcement learning with a GRPO-style (Group-Relative Policy Optimization) objective to better match a scalar reward signal (such as a learned preference model, a CLIP-based score, or some task-specific reward).
The interviewer asks:
"Describe how you would set up and implement a GRPO-style training loop to fine-tune a diffusion model. In particular:
-
How do you define states, actions, and rewards in this RL setting?
-
How do you sample trajectories and compute advantages for GRPO?
-
What loss/objective do you optimize, and how does it relate to policy gradients?
-
Give high-level pseudocode for one training iteration."
Assume:
-
You can sample multiple images per text prompt from the current policy (the diffusion model).
-
You can compute a scalar reward for each generated image.
-
You have access to the log-probability (or an approximation) of the sampled images under the diffusion model.
Explain the design and reasoning step-by-step, being explicit about how GRPO differs from a basic REINFORCE-style policy gradient.