How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Technical Screen rounds at Google.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Google during technical interviews.

Explain GRPO-style training for diffusion models

Quick Overview

This question evaluates understanding of reinforcement learning applied to diffusion-based generative models, covering policy optimization, reward modeling, and likelihood-aware scoring in the Machine Learning domain (specifically reinforcement learning and generative modeling).

You are given a pretrained image diffusion model that generates images conditioned on text prompts (e.g., a text-to-image model). You now want to fine-tune this model using reinforcement learning with a GRPO-style (Group-Relative Policy Optimization) objective to better match a scalar reward signal (such as a learned preference model, a CLIP-based score, or some task-specific reward).

The interviewer asks:

"Describe how you would set up and implement a GRPO-style training loop to fine-tune a diffusion model. In particular:

How do you define states, actions, and rewards in this RL setting?

How do you sample trajectories and compute advantages for GRPO?

What loss/objective do you optimize, and how does it relate to policy gradients?

Give high-level pseudocode for one training iteration."

Assume:

You can sample multiple images per text prompt from the current policy (the diffusion model).
You can compute a scalar reward for each generated image.
You have access to the log-probability (or an approximation) of the sampled images under the diffusion model.

Explain the design and reasoning step-by-step, being explicit about how GRPO differs from a basic REINFORCE-style policy gradient.

Quick Overview

The interviewer asks:

"Describe how you would set up and implement a GRPO-style training loop to fine-tune a diffusion model. In particular:

How do you define states, actions, and rewards in this RL setting?

How do you sample trajectories and compute advantages for GRPO?

What loss/objective do you optimize, and how does it relate to policy gradients?

Give high-level pseudocode for one training iteration."

Assume:

You can sample multiple images per text prompt from the current policy (the diffusion model).
You can compute a scalar reward for each generated image.
You have access to the log-probability (or an approximation) of the sampled images under the diffusion model.

Explain the design and reasoning step-by-step, being explicit about how GRPO differs from a basic REINFORCE-style policy gradient.

Explain GRPO-style training for diffusion models

Quick Overview

Solution

Submit Your Answer to Earn 20XP

Explain GRPO-style training for diffusion models

Quick Overview

Solution

Submit Your Answer to Earn 20XP