PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Tesla

Design RL reward for speed limits

Last updated: Jun 21, 2026

Quick Overview

This question evaluates reinforcement learning competencies—reward engineering for speed-constrained control, distinctions in policy optimization methods (PPO versus GRPO), and reasoning about time-varying constraints and partial observability using trajectory data.

  • hard
  • Tesla
  • Machine Learning
  • Machine Learning Engineer

Design RL reward for speed limits

Company: Tesla

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

## RL for Autonomous Driving — Conceptual + Practical You are training a reinforcement-learning agent to drive a vehicle. The interview moves from policy-optimization theory, to reward-design judgment, to a hands-on reward implementation and the subtle failure modes it exposes. ### Constraints & Assumptions - Trajectories arrive as batched 2D position tensors `traj` of shape `[batch, num_waypoint, 2]`, where the last axis is the `(x, y)` position. - Sampling frequency is **10 Hz**, so the time between consecutive waypoints is $\Delta t = 0.1\text{ s}$. - Positions are in a consistent metric frame; speeds are derived from consecutive positions. - The environment enforces a **speed limit** (in Part C.1 assume a single known constant; later parts relax this). - For Part C, implement vectorized / batched logic — do not assume a Python loop over the batch. ### Clarifying Questions to Ask - In what units are `(x, y)` expressed (meters vs feet), and in what units is the speed limit given (mph, m/s)? Is unit conversion my responsibility? - Should the reward be a single scalar per trajectory, or a per-step reward of shape `[batch, num_waypoint]` shaped along the rollout? - Is the speed limit a scalar shared across the batch, or per-trajectory / per-step? - Should I assume the trajectory is the agent's actual rollout (so I am scoring it), or a candidate to be ranked against others? - For Parts A and B, is the context closer to classic continuous-control RL (driving policy) or RLHF-style fine-tuning? (The interview spans both.) ### Part A — Policy optimization Explain the difference between **PPO** and **GRPO** as used in modern RL / RLHF-style training. Cover what objective each optimizes, how each estimates the advantage / baseline, and when you would prefer one over the other. ```hint What distinguishes them Both are clipped-ratio, KL-controlled policy-gradient methods — so they share the same broad update structure. The difference lies in what each method requires at training time beyond the policy itself, and in how the advantage signal for a given context is constructed. ``` #### What This Part Should Cover - PPO's clipped surrogate objective via the probability ratio $r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)}$, and its reliance on a learned critic / GAE advantage. - GRPO's group-relative advantage (sample $G$ outputs per context, normalize the reward against the group mean/std) and how this removes the critic. - The practical tradeoff: critic cost/stability vs. the need for multiple samples per context and group-statistic noise. ### Part B — Reward design choice In RL, when would you choose a **heuristic (hand-engineered) reward** versus a **learned reward** (e.g., a reward model trained from human preferences)? Discuss the tradeoffs. ```hint Frame by objective specifiability Split the objective into **hard constraints with crisp rules** (speeding, collisions) versus **fuzzy human preferences** (comfort, naturalness). The first favors heuristics; the second favors learned rewards. Then mention reward hacking on both sides. ``` #### What This Part Should Cover - Heuristic reward: pros (clear semantics, debuggable, deterministic, good for hard constraints) and cons (incompleteness, brittle proxies, reward hacking). - Learned reward: pros (captures fuzzy/qualitative preferences) and cons (data cost, reward-model overfitting/misgeneralization, adversarial exploitation). - A synthesis: hard safety constraints as heuristics, nuanced preferences as a learned signal, usually combined with safety dominating. ### Part C — Implement a speed-limit reward You are given batched 2D trajectories `traj` of shape `[batch, num_waypoint, 2]` sampled at 10 Hz, and the environment has a speed limit. **C.1** Design a reward (or penalty) that discourages speeding. Provide at least two variants: - (i) penalize **time spent speeding**, and - (ii) penalize **how much** the speed exceeds the limit. **C.2** Discuss the failure cases if the true limit is **time-varying** (e.g., 50 mph for the first 2 s, then 30 mph for the next 3 s) but your reward implementation assumes a single constant limit. **C.3** More generally: what happens if the speed limit changes but is **not represented in the agent's observation/state**? How would you fix the setup? ```hint Where to start (C.1) Speed is not given directly — you will need to derive it from the position data. Think about what operation on consecutive positions gives you a velocity magnitude, and what role $\Delta t$ plays. ``` ```hint The two variants (C.1) For variant (i), ask: should a step contribute to the penalty based purely on *whether* the limit is exceeded, or should the penalty ignore magnitude? For variant (ii), ask: should a step contribute proportionally to *how far* over the limit the speed is? Consider what differentiability and gradient shape each choice produces. ``` ```hint The "in general" trap (C.2 → C.3) Recomputing a per-step limit from the trajectory at training time is an **oracle**: that information is not available to the policy online. Ask yourself what property of the agent's decision problem is broken when the correct action depends on a variable the agent cannot observe. ``` #### What This Part Should Cover This dimension spans all of Part C. - C.1: correct speed derivation from positions ($\Delta t = 0.1$ s), both penalty variants stated precisely with their differing gradient behavior, vectorized over the batch, and awareness of the boundary-oscillation / "slam-brakes-at-the-edge" shaping issue. - C.2: recognition that a constant assumption under-penalizes when the limit drops and over-penalizes when it rises, yielding an inconsistent training signal; the corrected $v_{\max}(t)$ formulation. - C.3: naming the root cause as partial observability / non-stationarity from the agent's view (broken Markov property); fixes that restore it (put the limit in the observation, or give history + a recurrent/contextual policy, or train over randomized limits); and why "infer the limit post-hoc from the trajectory" is an oracle that masks the missing-input problem. ### Follow-up Questions - If you must keep a constant-limit reward for legacy reasons, how would you detect at training time that the agent is being trained against a hidden, time-varying objective? - How would domain randomization over speed limits during training interact with a recurrent policy that infers the limit from context? - Variant (i) is non-differentiable. In a policy-gradient setting where the reward need not be differentiable in the actions, does that matter? When would the smoothness of variant (ii) actually help? - How would you extend the reward to also discourage unsafe deceleration (e.g., harsh braking right at a limit boundary) without double-counting the speed penalty?

Quick Answer: This question evaluates reinforcement learning competencies—reward engineering for speed-constrained control, distinctions in policy optimization methods (PPO versus GRPO), and reasoning about time-varying constraints and partial observability using trajectory data.

Related Interview Questions

  • How to Identify Best Battery Group - Tesla (medium)
  • Compare RNNs, LSTMs, Transformers, and MPC - Tesla (hard)
  • Compute Conv2D parameter counts - Tesla (easy)
  • Implement attention and Transformer with backward pass - Tesla (hard)
|Home/Machine Learning/Tesla

Design RL reward for speed limits

Tesla logo
Tesla
Feb 12, 2026, 12:00 AM
hardMachine Learning EngineerTechnical ScreenMachine Learning
24
0

RL for Autonomous Driving — Conceptual + Practical

You are training a reinforcement-learning agent to drive a vehicle. The interview moves from policy-optimization theory, to reward-design judgment, to a hands-on reward implementation and the subtle failure modes it exposes.

Constraints & Assumptions

  • Trajectories arrive as batched 2D position tensors traj of shape [batch, num_waypoint, 2] , where the last axis is the (x, y) position.
  • Sampling frequency is 10 Hz , so the time between consecutive waypoints is Δt=0.1 s\Delta t = 0.1\text{ s}Δt=0.1 s .
  • Positions are in a consistent metric frame; speeds are derived from consecutive positions.
  • The environment enforces a speed limit (in Part C.1 assume a single known constant; later parts relax this).
  • For Part C, implement vectorized / batched logic — do not assume a Python loop over the batch.

Clarifying Questions to Ask

  • In what units are (x, y) expressed (meters vs feet), and in what units is the speed limit given (mph, m/s)? Is unit conversion my responsibility?
  • Should the reward be a single scalar per trajectory, or a per-step reward of shape [batch, num_waypoint] shaped along the rollout?
  • Is the speed limit a scalar shared across the batch, or per-trajectory / per-step?
  • Should I assume the trajectory is the agent's actual rollout (so I am scoring it), or a candidate to be ranked against others?
  • For Parts A and B, is the context closer to classic continuous-control RL (driving policy) or RLHF-style fine-tuning? (The interview spans both.)

Part A — Policy optimization

Explain the difference between PPO and GRPO as used in modern RL / RLHF-style training. Cover what objective each optimizes, how each estimates the advantage / baseline, and when you would prefer one over the other.

What This Part Should Cover

  • PPO's clipped surrogate objective via the probability ratio rt(θ)=πθ(at∣st)πold(at∣st)r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)}rt​(θ)=πold​(at​∣st​)πθ​(at​∣st​)​ , and its reliance on a learned critic / GAE advantage.
  • GRPO's group-relative advantage (sample GGG outputs per context, normalize the reward against the group mean/std) and how this removes the critic.
  • The practical tradeoff: critic cost/stability vs. the need for multiple samples per context and group-statistic noise.

Part B — Reward design choice

In RL, when would you choose a heuristic (hand-engineered) reward versus a learned reward (e.g., a reward model trained from human preferences)? Discuss the tradeoffs.

What This Part Should Cover

  • Heuristic reward: pros (clear semantics, debuggable, deterministic, good for hard constraints) and cons (incompleteness, brittle proxies, reward hacking).
  • Learned reward: pros (captures fuzzy/qualitative preferences) and cons (data cost, reward-model overfitting/misgeneralization, adversarial exploitation).
  • A synthesis: hard safety constraints as heuristics, nuanced preferences as a learned signal, usually combined with safety dominating.

Part C — Implement a speed-limit reward

You are given batched 2D trajectories traj of shape [batch, num_waypoint, 2] sampled at 10 Hz, and the environment has a speed limit.

C.1 Design a reward (or penalty) that discourages speeding. Provide at least two variants:

  • (i) penalize time spent speeding , and
  • (ii) penalize how much the speed exceeds the limit.

C.2 Discuss the failure cases if the true limit is time-varying (e.g., 50 mph for the first 2 s, then 30 mph for the next 3 s) but your reward implementation assumes a single constant limit.

C.3 More generally: what happens if the speed limit changes but is not represented in the agent's observation/state? How would you fix the setup?

What This Part Should Cover

This dimension spans all of Part C.

  • C.1: correct speed derivation from positions ( Δt=0.1\Delta t = 0.1Δt=0.1 s), both penalty variants stated precisely with their differing gradient behavior, vectorized over the batch, and awareness of the boundary-oscillation / "slam-brakes-at-the-edge" shaping issue.
  • C.2: recognition that a constant assumption under-penalizes when the limit drops and over-penalizes when it rises, yielding an inconsistent training signal; the corrected vmax⁡(t)v_{\max}(t)vmax​(t) formulation.
  • C.3: naming the root cause as partial observability / non-stationarity from the agent's view (broken Markov property); fixes that restore it (put the limit in the observation, or give history + a recurrent/contextual policy, or train over randomized limits); and why "infer the limit post-hoc from the trajectory" is an oracle that masks the missing-input problem.

Follow-up Questions

  • If you must keep a constant-limit reward for legacy reasons, how would you detect at training time that the agent is being trained against a hidden, time-varying objective?
  • How would domain randomization over speed limits during training interact with a recurrent policy that infers the limit from context?
  • Variant (i) is non-differentiable. In a policy-gradient setting where the reward need not be differentiable in the actions, does that matter? When would the smoothness of variant (ii) actually help?
  • How would you extend the reward to also discourage unsafe deceleration (e.g., harsh braking right at a limit boundary) without double-counting the speed penalty?
Loading comments...

Browse More Questions

More Machine Learning•More Tesla•More Machine Learning Engineer•Tesla Machine Learning Engineer•Tesla Machine Learning•Machine Learning Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.