Design RL reward for speed limits
Company: Tesla
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
## RL for Autonomous Driving — Conceptual + Practical
You are training a reinforcement-learning agent to drive a vehicle. The interview moves from policy-optimization theory, to reward-design judgment, to a hands-on reward implementation and the subtle failure modes it exposes.
### Constraints & Assumptions
- Trajectories arrive as batched 2D position tensors `traj` of shape `[batch, num_waypoint, 2]`, where the last axis is the `(x, y)` position.
- Sampling frequency is **10 Hz**, so the time between consecutive waypoints is $\Delta t = 0.1\text{ s}$.
- Positions are in a consistent metric frame; speeds are derived from consecutive positions.
- The environment enforces a **speed limit** (in Part C.1 assume a single known constant; later parts relax this).
- For Part C, implement vectorized / batched logic — do not assume a Python loop over the batch.
### Clarifying Questions to Ask
- In what units are `(x, y)` expressed (meters vs feet), and in what units is the speed limit given (mph, m/s)? Is unit conversion my responsibility?
- Should the reward be a single scalar per trajectory, or a per-step reward of shape `[batch, num_waypoint]` shaped along the rollout?
- Is the speed limit a scalar shared across the batch, or per-trajectory / per-step?
- Should I assume the trajectory is the agent's actual rollout (so I am scoring it), or a candidate to be ranked against others?
- For Parts A and B, is the context closer to classic continuous-control RL (driving policy) or RLHF-style fine-tuning? (The interview spans both.)
### Part A — Policy optimization
Explain the difference between **PPO** and **GRPO** as used in modern RL / RLHF-style training. Cover what objective each optimizes, how each estimates the advantage / baseline, and when you would prefer one over the other.
```hint What distinguishes them
Both are clipped-ratio, KL-controlled policy-gradient methods — so they share the same broad update structure. The difference lies in what each method requires at training time beyond the policy itself, and in how the advantage signal for a given context is constructed.
```
#### What This Part Should Cover
- PPO's clipped surrogate objective via the probability ratio $r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)}$, and its reliance on a learned critic / GAE advantage.
- GRPO's group-relative advantage (sample $G$ outputs per context, normalize the reward against the group mean/std) and how this removes the critic.
- The practical tradeoff: critic cost/stability vs. the need for multiple samples per context and group-statistic noise.
### Part B — Reward design choice
In RL, when would you choose a **heuristic (hand-engineered) reward** versus a **learned reward** (e.g., a reward model trained from human preferences)? Discuss the tradeoffs.
```hint Frame by objective specifiability
Split the objective into **hard constraints with crisp rules** (speeding, collisions) versus **fuzzy human preferences** (comfort, naturalness). The first favors heuristics; the second favors learned rewards. Then mention reward hacking on both sides.
```
#### What This Part Should Cover
- Heuristic reward: pros (clear semantics, debuggable, deterministic, good for hard constraints) and cons (incompleteness, brittle proxies, reward hacking).
- Learned reward: pros (captures fuzzy/qualitative preferences) and cons (data cost, reward-model overfitting/misgeneralization, adversarial exploitation).
- A synthesis: hard safety constraints as heuristics, nuanced preferences as a learned signal, usually combined with safety dominating.
### Part C — Implement a speed-limit reward
You are given batched 2D trajectories `traj` of shape `[batch, num_waypoint, 2]` sampled at 10 Hz, and the environment has a speed limit.
**C.1** Design a reward (or penalty) that discourages speeding. Provide at least two variants:
- (i) penalize **time spent speeding**, and
- (ii) penalize **how much** the speed exceeds the limit.
**C.2** Discuss the failure cases if the true limit is **time-varying** (e.g., 50 mph for the first 2 s, then 30 mph for the next 3 s) but your reward implementation assumes a single constant limit.
**C.3** More generally: what happens if the speed limit changes but is **not represented in the agent's observation/state**? How would you fix the setup?
```hint Where to start (C.1)
Speed is not given directly — you will need to derive it from the position data. Think about what operation on consecutive positions gives you a velocity magnitude, and what role $\Delta t$ plays.
```
```hint The two variants (C.1)
For variant (i), ask: should a step contribute to the penalty based purely on *whether* the limit is exceeded, or should the penalty ignore magnitude? For variant (ii), ask: should a step contribute proportionally to *how far* over the limit the speed is? Consider what differentiability and gradient shape each choice produces.
```
```hint The "in general" trap (C.2 → C.3)
Recomputing a per-step limit from the trajectory at training time is an **oracle**: that information is not available to the policy online. Ask yourself what property of the agent's decision problem is broken when the correct action depends on a variable the agent cannot observe.
```
#### What This Part Should Cover
This dimension spans all of Part C.
- C.1: correct speed derivation from positions ($\Delta t = 0.1$ s), both penalty variants stated precisely with their differing gradient behavior, vectorized over the batch, and awareness of the boundary-oscillation / "slam-brakes-at-the-edge" shaping issue.
- C.2: recognition that a constant assumption under-penalizes when the limit drops and over-penalizes when it rises, yielding an inconsistent training signal; the corrected $v_{\max}(t)$ formulation.
- C.3: naming the root cause as partial observability / non-stationarity from the agent's view (broken Markov property); fixes that restore it (put the limit in the observation, or give history + a recurrent/contextual policy, or train over randomized limits); and why "infer the limit post-hoc from the trajectory" is an oracle that masks the missing-input problem.
### Follow-up Questions
- If you must keep a constant-limit reward for legacy reasons, how would you detect at training time that the agent is being trained against a hidden, time-varying objective?
- How would domain randomization over speed limits during training interact with a recurrent policy that infers the limit from context?
- Variant (i) is non-differentiable. In a policy-gradient setting where the reward need not be differentiable in the actions, does that matter? When would the smoothness of variant (ii) actually help?
- How would you extend the reward to also discourage unsafe deceleration (e.g., harsh braking right at a limit boundary) without double-counting the speed penalty?
Quick Answer: This question evaluates reinforcement learning competencies—reward engineering for speed-constrained control, distinctions in policy optimization methods (PPO versus GRPO), and reasoning about time-varying constraints and partial observability using trajectory data.