Design RL reward for speed limits

Q: Design RL reward for speed limits

This is a Machine Learning interview question from Tesla for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

RL Questions (conceptual + practical)

You are training an RL agent for driving.

Part A — Policy optimization

Explain the difference between PPO and GRPO (as used in modern RL/RLHF-style training).

Part B — Reward design choice

In RL, when would you choose a heuristic reward vs a learned reward (e.g., reward model from preferences)? What are the tradeoffs?

Part C — Implement a speed-limit reward

You are given batched 2D trajectories sampled at 10 Hz:

Input: traj of shape [batch, num_waypoint, 2] , representing (x, y) positions over time.
The environment has a speed limit .

Design a reward (or penalty) that discourages speeding. Provide at least two variants:
- (i) penalize time spent speeding
- (ii) penalize how much the speed exceeds the limit
Discuss failure cases if the speed limit is time-varying (e.g., first 2 seconds: 50 mph, next 3 seconds: 30 mph) but your reward implementation assumes a single constant limit.
More generally: what happens if the speed limit changes but is not represented in the agent’s observation/state ? How would you fix the setup?

Design RL reward for speed limits

RL Questions (conceptual + practical)

Part A — Policy optimization

Part B — Reward design choice

Part C — Implement a speed-limit reward

Solution

Comments (0)