Reinforcement Learning Reward Design for Control

What's being tested

Interviewers probe your ability to translate a real-world control constraint (e.g., speed limits) into an RL-compatible objective that yields safe, robust policies while remaining trainable and evaluable. They're checking for knowledge of reward engineering, constraint handling (soft penalties vs. formal constrained optimization), partial-observability remedies, and practical ML-engineering tradeoffs: sample efficiency, evaluation metrics, and safe deployment paths. You must argue choices (penalty magnitudes, episodic vs. instantaneous costs), pick an algorithmic family appropriate to the constraint type, and describe how you'd validate and monitor the policy in simulation and real runs.

Core knowledge

Markov Decision Process (MDP) vs POMDP: know that partial observability yields a POMDP; mitigate via state augmentation, stacked observations, or recurrent policies (e.g., RNN/LSTM policy network) to form a belief-like state for control decisions.
Potential-based reward shaping: guaranteed policy invariance when using $R'(s,a,s') = R(s,a,s') + \gamma \Phi(s') - \Phi(s)$ ; use for accelerating learning without changing optimal policy; design $\Phi$ carefully to avoid introducing unwanted optima.
Hard constraints vs soft penalties: express hard constraints with a constrained optimization (e.g., Constrained Policy Optimization (CPO)) or a high-cost terminal penalty; use Lagrangian methods to solve maximize E[sum R] s.t. E[sum C] ≤ d, via L(π,λ)=E[R]-λ(E[C]-d).
Instantaneous vs cumulative costs: time-averaged constraint (e.g., expected speed violation time) needs cumulative-cost modeling; instantaneous penalties bias short-term behavior—choose based on system spec.
Reward scale and normalization: unbalanced magnitudes cause optimization to ignore smaller components; normalize components (z-score or divide by expected magnitude) and treat reward weights as hyperparameters tuned with sensitivity sweeps.
Sparse catastrophic penalties: a single huge terminal penalty for violation often causes unstable exploration; prefer clipped continuous penalties plus an indicator for catastrophic failure to stabilize gradients.
Evaluation metrics for control under constraints: track (a) violation rate (% steps exceeding speed), (b) time-weighted violation (area over limit), (c) cumulative reward, (d) safety-critical percentiles (e.g., p99 worst-case), and (e) sample-efficiency (environment steps to target performance).
Algorithm choices & sample efficiency: on-policy methods like PPO are stable but sample-inefficient; constrained variants (CPO/Lagrangian PPO) better for constraints; model-based or off-policy (e.g., SAC) can reduce interactions but require careful reward/constraint integration.
Off-policy/offline evaluation: use importance sampling / Weighted IS / Doubly Robust estimators to evaluate policies without deployment, but expect high variance on long horizons; prefer high-quality simulators for safe testing.
Sim-to-real considerations: use CARLA/Gazebo or a high-fidelity simulator with domain randomization to cover sensor/state distributional shift; instrument for online monitoring and rollback on violations.
Tip: instrument a separate constraint critic (value estimate of future constraint cost) so you can evaluate expected future violations and integrate into a Lagrangian update.

Worked example — Design RL reward for speed limits

First 30 seconds: clarify whether the speed limit is a hard safety constraint (zero tolerance) or a soft operational constraint, the control frequency, available sensors for speed, whether the environment is episodic, and acceptable tradeoffs between speed and completion time. Frame the solution around three pillars: (1) objective decomposition (progress vs. constraint cost vs. comfort/control cost), (2) constraint enforcement method (Lagrangian or constrained policy algorithm vs shaped penalty), and (3) evaluation & deployment safety checks. Propose a reward: R = w_p * progress - w_v * max(0, v - v_limit) - w_u * ||u||^2, with a separate catastrophic penalty for sustained high-speed violations. Discuss a concrete design decision: prefer a Lagrangian PPO style loop when violations must meet a soft budget d — maintain λ via dual ascent to balance speed and safety, because naive weighting is brittle. Close by saying you’d run sensitivity sweeps on w_v and λ, compare PPO vs a constrained variant on violation rate and task completion, and add a safety fallback (simple handcrafted controller) during real-world rollouts if violations exceed thresholds.

A second angle — time-varying limits and partial observability

When the speed limit changes over time (e.g., variable signage or geo-fenced zones) or is not always observed, condition your policy on the time/context signal or inferred limit. If the limit is observable, include it as an explicit input (feature) and train a single policy conditioned on limit: π(a|s,limit). If unobserved, treat it as a latent variable—use recurrence to infer it from sensory cues (sign detection, GNSS geofence) and maintain a constraint critic that predicts expected future violations given the inferred belief. For time-varying constraints, prefer a constrained optimization with time-indexed budgets (E[sum_t C_t] ≤ d_t) or use online adaptation of the Lagrange multiplier to respect shifting safety budgets; emphasize monitoring during distributional shifts and quick fallback to rule-based controllers.

Common pitfalls

Pitfall: Reward hacking via poorly scaled penalties. If you set the violation penalty too small, the agent will maximize progress at the cost of frequent violations; too large and the agent refuses to move. Always report a sweep and the Pareto frontier between reward and violations.

Pitfall: Confusing hard constraints with soft penalties. Promising “no violations” while using only a soft penalty is risky—interviewers expect a discussion of algorithmic enforcement (CPO/Lagrangian) or verification-level checks for hard safety.

Pitfall: Ignoring partial observability and evaluation variance. Deploying a policy that performed in simulator but relied on unobservable signals will fail; likewise, off-policy evaluation without variance reduction will overstate safety.

Connections

Interviewers may pivot to safe RL literature (formal safety guarantees, reachability analysis), imitation learning/offline RL (when expert demonstrations exist to avoid unsafe exploration), or model-based control (MPC + learned models for constraint satisfaction).

What's being tested

Core knowledge

Worked example — Design RL reward for speed limits

A second angle — time-varying limits and partial observability

Common pitfalls

Connections

Further reading

Practice questions

Related concepts