Explain RL policy types and modern policy gradients

Q: Explain RL policy types and modern policy gradients

This is a Machine Learning interview question from TikTok for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Loading...

Machine Learning Fundamentals (RL + Attention)

Part A — Reinforcement Learning

Define on-policy vs off-policy learning.
- What makes an algorithm on-policy/off-policy?
- Give examples of each.
Explain TRPO (Trust Region Policy Optimization).
- What problem is it trying to solve compared to vanilla policy gradient?
- What is the role of a KL-divergence “trust region” constraint?
Explain PPO (Proximal Policy Optimization).
- How does PPO approximate TRPO’s trust-region behavior?
- What is the clipped surrogate objective trying to prevent?
Explain GAE (Generalized Advantage Estimation).
- What is the definition of the advantage function?
- How does GAE trade off bias vs variance, and what does the $\lambda$ parameter do?

Part B — Attention Mechanisms

Explain linear attention .
- Why is it called “linear” (in what variable does complexity become linear)?
- What approximation or structural change is made vs. standard softmax attention?
- What information can be lost compared with exact softmax attention?
Explain Group-Query Attention (GQA) .
- What is grouped/shared between heads?
- Why does it help inference efficiency and KV-cache memory?
- What are typical quality/accuracy trade-offs?

Explain RL policy types and modern policy gradients

Machine Learning Fundamentals (RL + Attention)

Part A — Reinforcement Learning

Part B — Attention Mechanisms

Solution

Comments (0)