RL Questions (conceptual + practical)
You are training an RL agent for driving.
Part A — Policy optimization
-
Explain the difference between
PPO
and
GRPO
(as used in modern RL/RLHF-style training).
Part B — Reward design choice
-
In RL, when would you choose a
heuristic reward
vs a
learned reward
(e.g., reward model from preferences)? What are the tradeoffs?
Part C — Implement a speed-limit reward
You are given batched 2D trajectories sampled at 10 Hz:
-
Input:
traj
of shape
[batch, num_waypoint, 2]
, representing
(x, y)
positions over time.
-
The environment has a
speed limit
.
-
Design a reward (or penalty) that discourages speeding. Provide at least two variants:
-
(i) penalize
time spent speeding
-
(ii) penalize
how much
the speed exceeds the limit
-
Discuss failure cases if the speed limit is
time-varying
(e.g., first 2 seconds: 50 mph, next 3 seconds: 30 mph) but your reward implementation assumes a single constant limit.
-
More generally: what happens if the speed limit changes but is
not represented in the agent’s observation/state
? How would you fix the setup?