Explain GRPO-style training for diffusion models

Q: Explain GRPO-style training for diffusion models

This is a Machine Learning interview question from Google for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

You are given a pretrained image diffusion model that generates images conditioned on text prompts (e.g., a text-to-image model). You now want to fine-tune this model using reinforcement learning with a GRPO-style (Group-Relative Policy Optimization) objective to better match a scalar reward signal (such as a learned preference model, a CLIP-based score, or some task-specific reward).

The interviewer asks:

"Describe how you would set up and implement a GRPO-style training loop to fine-tune a diffusion model. In particular:

How do you define states, actions, and rewards in this RL setting?

How do you sample trajectories and compute advantages for GRPO?

What loss/objective do you optimize, and how does it relate to policy gradients?

Give high-level pseudocode for one training iteration."

Assume:

You can sample multiple images per text prompt from the current policy (the diffusion model).
You can compute a scalar reward for each generated image.
You have access to the log-probability (or an approximation) of the sampled images under the diffusion model.

Explain the design and reasoning step-by-step, being explicit about how GRPO differs from a basic REINFORCE-style policy gradient.

Explain GRPO-style training for diffusion models

Solution

Comments (0)