PPO, Bellman Equations, On-/Off-Policy Learning, and Transformer Basics
Context: You are interviewing for a machine learning role with emphasis on reinforcement learning and sequence modeling. Answer the following concisely but completely, using formulas where helpful.
Tasks
-
PPO vs. vanilla policy gradients and TRPO
-
Explain the main advantages of Proximal Policy Optimization (PPO) over vanilla policy gradients and Trust Region Policy Optimization (TRPO).
-
Bellman equations and actor–critic
-
Write and interpret the Bellman equation for value functions (for V and Q).
-
Explain how the Bellman equation is used in actor–critic methods.
-
On-policy vs. off-policy
-
Contrast on-policy and off-policy learning in terms of data reuse, stability, and sample efficiency.
-
State where PPO fits and why.
-
PPO objectives and stability components
-
Derive or describe the clipped surrogate objective in PPO.
-
Explain the roles of clipping, entropy bonus, advantage normalization, and GAE in training stability.
-
Transformer basics for sequence modeling and RL
-
Explain how self-attention works (queries, keys, values).
-
Describe positional encodings.
-
Compare computational complexity vs. RNNs/CNNs.
-
Discuss when you might prefer Transformers in RL or sequence modeling.