PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Amazon

GRPO Deep Dive: Critic-Free RL, Parallelism, MLA, and Reward Design for a Reasoning LLM

Last updated: Jul 1, 2026

Quick Overview

This question assesses understanding of reinforcement learning algorithms used to post-train large reasoning language models, including critic-free policy optimization, distributed training parallelism, and reward design. It is commonly asked in machine learning engineering interviews to gauge depth of practical, systems-level knowledge rather than textbook familiarity with RL theory. The question probes conceptual grasp alongside applied reasoning about real training failure modes.

  • hard
  • Amazon
  • Machine Learning
  • Machine Learning Engineer

GRPO Deep Dive: Critic-Free RL, Parallelism, MLA, and Reward Design for a Reasoning LLM

Company: Amazon

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

You are interviewing for an applied-science / ML-engineer role on a team that trains large **reasoning** language models with reinforcement learning. The interviewer runs a deep dive on **Group Relative Policy Optimization (GRPO)** and the training stack around it — large-scale parallelism, the attention design, and reward design — and probes for hands-on detail, not textbook summaries. Walk through the algorithm and the system that surrounds it: explain GRPO and why it drops the critic, the parallelism strategies (including DualPipe and how nodes talk to each other), Multi-head Latent Attention (MLA), how you would design the reward, and the failure modes you would expect in practice and how you would fix them. Reason from the math and, wherever you can, from what actually happens during a training run. ### Constraints & Assumptions - Post-training (RL) stage of a decoder-only transformer, on the order of tens of billions of parameters or more, possibly a Mixture-of-Experts (MoE) model. RL may start from a base or an SFT'd checkpoint. - Target tasks are reasoning tasks with **verifiable** answers (math with a known result, code with unit tests), so a correctness signal is available without a human in the loop. - Hardware is a multi-node GPU cluster: high-bandwidth NVLink/NVSwitch **within** a node, InfiniBand/RDMA **between** nodes. - Rollouts (generation) and gradient updates may run in separate engines (e.g., a fast inference engine for sampling, a training engine for updates). - "Better" means stable training, sample/compute efficiency, and final reasoning quality — not just a lower loss number. ### Clarifying Questions to Ask - Are we starting RL from the base model or from an SFT checkpoint, and is the goal a research result or a model with production latency/throughput budgets too? - Dense or MoE? Roughly how many parameters, and how many GPUs and nodes are available? - What reward signal do we actually have — only verifiable correctness, or human preferences / a learned reward model as well? - What is the rollout setup: in-training generation, or a separate inference engine with periodic weight sync? How stale can rollout policy weights get? - What is the context length we need to support at train and inference time (this drives the attention/KV-cache decisions)? - Are there constraints on what we can change — fixed architecture, or are attention design and parallelism on the table? ### Part 1 — GRPO and why the critic is removed Explain GRPO. What does it solve relative to PPO, and *why* does it remove the value/critic network? Give the advantage formulation and the objective, and lay out the pros and cons. ```hint Baseline without a value model PPO subtracts a learned value baseline $V(s)$ to reduce variance. GRPO needs a baseline too, but it gets one *for free* by sampling a **group** of $G$ completions per prompt and comparing each completion against its own group. Write the advantage in terms of the group's reward statistics. ``` ```hint What the critic costs you A PPO critic is roughly a second model the size of the policy: extra memory, extra forward/backward, and its own training instability. Ask what per-token signal that critic is even estimating when the reward only arrives at the *end* of a long generation. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 — Parallelism and inter-node communication Enumerate the parallelism strategies for training a model at this scale and map each to the hardware. Then explain **DualPipe** and how nodes actually communicate. ```hint The four axes Separate *what* gets split: the batch (data parallel), a single layer's matmuls (tensor parallel), the stack of layers (pipeline parallel), and the experts of an MoE (expert parallel). Each implies a different collective — all-reduce, all-to-all, point-to-point — and a different sensitivity to link bandwidth. ``` ```hint What DualPipe is buying Pipeline parallelism wastes time in "bubbles," and MoE adds expensive all-to-all token routing. DualPipe attacks both by scheduling micro-batches from *both ends* of the pipeline and overlapping the dispatch/combine communication with compute. Think about what it costs in memory to run the pipeline in two directions. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 — Multi-head Latent Attention (MLA) Describe MLA. What problem does it target, how does the low-rank KV compression work, and why is the RoPE component *decoupled*? Compare against MHA, MQA, and GQA. ```hint Follow the KV cache The decode-time bottleneck is the KV cache: standard MHA stores full $K,V$ for every head at every position. MLA caches a single small **latent** vector per token and up-projects to per-head $K,V$ on the fly. Ask what dimension you are actually storing now. ``` ```hint Why RoPE has to be split out The up-projection matrices can be folded into the query/output projections to avoid recomputation — but RoPE is position-dependent and does **not** commute with that folding. That tension is exactly why MLA carries a small separate RoPE-bearing key alongside the compressed content key. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 4 — Reward design How would you design the reward for RL training of a reasoning model? Argue rule-based vs. a learned reward model, and process vs. outcome rewards. ```hint Where reward hacking comes from A neural reward model is itself a function the policy can over-optimize — at scale the policy will find its blind spots. Verifiable tasks let you sidestep that. Think about what you can check deterministically (final answer, test pass, output format) and combine those into the reward. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 5 — Training failure modes and fixes What goes wrong in practice when training with GRPO, and how do you diagnose and fix it? Distinguish surface-level guesses from things you only learn by actually running the training. ```hint Watch the signals, not just the loss Name the curves you would actually watch — reward, KL to reference, response length, entropy, fraction of "all-tie" groups, grad norm — and tie each pathology (policy collapse, reward hacking, verbosity blow-up, entropy collapse, dead groups) to the signal that catches it and the knob that fixes it. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - GRPO normalizes advantages by the group's standard deviation and divides the per-sample loss by response length. Both have been argued to *introduce* bias. Where does each bias come from, and how would you remove it while keeping the critic-free structure? - With MLA, the up-projection can be absorbed into neighboring matrices at inference. Show concretely which matrices absorb into which, and why the decoupled RoPE key is the one piece that cannot be absorbed. - You observe reward climbing steadily but held-out accuracy flat or dropping. Walk through your diagnosis: how do you tell reward hacking from a genuine train/eval gap, and what do you change first? - For an MoE model, how does the choice of expert-parallel degree and node-limited routing interact with DualPipe's overlap, and what happens to your effective batch size and load balance if experts are imbalanced across nodes?

Quick Answer: This question assesses understanding of reinforcement learning algorithms used to post-train large reasoning language models, including critic-free policy optimization, distributed training parallelism, and reward design. It is commonly asked in machine learning engineering interviews to gauge depth of practical, systems-level knowledge rather than textbook familiarity with RL theory. The question probes conceptual grasp alongside applied reasoning about real training failure modes.

Related Interview Questions

  • LLM Fundamentals: Tokenization Design and KL-Regularized SFT - Amazon (medium)
  • Predicting the Next Elevator Call Location - Amazon (medium)
  • Explain Transformer and MoE Fundamentals - Amazon (medium)
  • Explain Core ML Interview Concepts - Amazon (hard)
  • Evaluate NLP Classification Models - Amazon (easy)
|Home/Machine Learning/Amazon

GRPO Deep Dive: Critic-Free RL, Parallelism, MLA, and Reward Design for a Reasoning LLM

Amazon logo
Amazon
Jun 21, 2026, 12:00 AM
hardMachine Learning EngineerTechnical ScreenMachine Learning
0
0

You are interviewing for an applied-science / ML-engineer role on a team that trains large reasoning language models with reinforcement learning. The interviewer runs a deep dive on Group Relative Policy Optimization (GRPO) and the training stack around it — large-scale parallelism, the attention design, and reward design — and probes for hands-on detail, not textbook summaries.

Walk through the algorithm and the system that surrounds it: explain GRPO and why it drops the critic, the parallelism strategies (including DualPipe and how nodes talk to each other), Multi-head Latent Attention (MLA), how you would design the reward, and the failure modes you would expect in practice and how you would fix them. Reason from the math and, wherever you can, from what actually happens during a training run.

Constraints & Assumptions

  • Post-training (RL) stage of a decoder-only transformer, on the order of tens of billions of parameters or more, possibly a Mixture-of-Experts (MoE) model. RL may start from a base or an SFT'd checkpoint.
  • Target tasks are reasoning tasks with verifiable answers (math with a known result, code with unit tests), so a correctness signal is available without a human in the loop.
  • Hardware is a multi-node GPU cluster: high-bandwidth NVLink/NVSwitch within a node, InfiniBand/RDMA between nodes.
  • Rollouts (generation) and gradient updates may run in separate engines (e.g., a fast inference engine for sampling, a training engine for updates).
  • "Better" means stable training, sample/compute efficiency, and final reasoning quality — not just a lower loss number.

Clarifying Questions to Ask

  • Are we starting RL from the base model or from an SFT checkpoint, and is the goal a research result or a model with production latency/throughput budgets too?
  • Dense or MoE? Roughly how many parameters, and how many GPUs and nodes are available?
  • What reward signal do we actually have — only verifiable correctness, or human preferences / a learned reward model as well?
  • What is the rollout setup: in-training generation, or a separate inference engine with periodic weight sync? How stale can rollout policy weights get?
  • What is the context length we need to support at train and inference time (this drives the attention/KV-cache decisions)?
  • Are there constraints on what we can change — fixed architecture, or are attention design and parallelism on the table?

Part 1 — GRPO and why the critic is removed

Explain GRPO. What does it solve relative to PPO, and why does it remove the value/critic network? Give the advantage formulation and the objective, and lay out the pros and cons.

What This Part Should Cover Premium

Part 2 — Parallelism and inter-node communication

Enumerate the parallelism strategies for training a model at this scale and map each to the hardware. Then explain DualPipe and how nodes actually communicate.

What This Part Should Cover Premium

Part 3 — Multi-head Latent Attention (MLA)

Describe MLA. What problem does it target, how does the low-rank KV compression work, and why is the RoPE component decoupled? Compare against MHA, MQA, and GQA.

What This Part Should Cover Premium

Part 4 — Reward design

How would you design the reward for RL training of a reasoning model? Argue rule-based vs. a learned reward model, and process vs. outcome rewards.

What This Part Should Cover Premium

Part 5 — Training failure modes and fixes

What goes wrong in practice when training with GRPO, and how do you diagnose and fix it? Distinguish surface-level guesses from things you only learn by actually running the training.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • GRPO normalizes advantages by the group's standard deviation and divides the per-sample loss by response length. Both have been argued to introduce bias. Where does each bias come from, and how would you remove it while keeping the critic-free structure?
  • With MLA, the up-projection can be absorbed into neighboring matrices at inference. Show concretely which matrices absorb into which, and why the decoupled RoPE key is the one piece that cannot be absorbed.
  • You observe reward climbing steadily but held-out accuracy flat or dropping. Walk through your diagnosis: how do you tell reward hacking from a genuine train/eval gap, and what do you change first?
  • For an MoE model, how does the choice of expert-parallel degree and node-limited routing interact with DualPipe's overlap, and what happens to your effective batch size and load balance if experts are imbalanced across nodes?
Loading comments...

Browse More Questions

More Machine Learning•More Amazon•More Machine Learning Engineer•Amazon Machine Learning Engineer•Amazon Machine Learning•Machine Learning Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.