Explain RL policy types and modern policy gradients
Company: TikTok
Role: Software Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
## Machine Learning Fundamentals (RL + Attention)
### Part A — Reinforcement Learning
1. Define **on-policy** vs **off-policy** learning.
- What makes an algorithm on-policy/off-policy?
- Give examples of each.
2. Explain **TRPO** (Trust Region Policy Optimization).
- What problem is it trying to solve compared to vanilla policy gradient?
- What is the role of a KL-divergence “trust region” constraint?
3. Explain **PPO** (Proximal Policy Optimization).
- How does PPO approximate TRPO’s trust-region behavior?
- What is the clipped surrogate objective trying to prevent?
4. Explain **GAE** (Generalized Advantage Estimation).
- What is the definition of the advantage function?
- How does GAE trade off bias vs variance, and what does the \(\lambda\) parameter do?
### Part B — Attention Mechanisms
1. Explain **linear attention**.
- Why is it called “linear” (in what variable does complexity become linear)?
- What approximation or structural change is made vs. standard softmax attention?
- What information can be lost compared with exact softmax attention?
2. Explain **Group-Query Attention (GQA)**.
- What is grouped/shared between heads?
- Why does it help inference efficiency and KV-cache memory?
- What are typical quality/accuracy trade-offs?
Quick Answer: This question evaluates understanding of reinforcement learning policy types and modern policy-gradient techniques (TRPO, PPO, GAE) alongside attention mechanism variants (linear attention, Group‑Query Attention), testing competency in algorithmic trade-offs, stability, bias–variance dynamics, and efficiency considerations.