Explain RL policy types and modern policy gradients
Company: TikTok
Role: Software Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
## Machine Learning Fundamentals: RL Policy Methods & Efficient Attention
This is a conceptual machine-learning interview spanning two areas: modern policy-gradient reinforcement learning and efficient attention mechanisms. The interviewer is probing both your mental model of *why* each technique exists and your ability to state the key equations precisely. Answer each Part in order; expect rapid follow-ups asking you to justify a trade-off or write a formula.
### Constraints & Assumptions
- This is a whiteboard/verbal discussion — no coding is required for these Parts.
- You should be able to write the core objective/estimator for each method (e.g. the PPO clipped objective, the GAE sum) and explain every symbol.
- Assume a standard discounted MDP with discount factor $\gamma \in [0,1)$, states $s$, actions $a$, reward $r_t$, a stochastic policy $\pi_\theta(a\mid s)$, and (where needed) a learned value function $V(s)$.
- For the attention Part, assume a sequence of length $n$ with per-head dimension $d$, and standard scaled dot-product attention as the baseline.
### Clarifying Questions to Ask
- Should I focus on the standard/original formulations of each method, or on specific production variants the team uses?
- For the RL methods, do you want the math (objectives, KL constraint, GAE sum) or a higher-level intuition pass first?
- For attention, are we discussing these in the context of LLM inference/serving, or training-time compute?
- How much depth on implementation pitfalls (e.g. conjugate gradient for TRPO, KV-cache layout for GQA) versus conceptual definitions?
### Part A — Reinforcement Learning: Policy Methods
Answer the following on policy-gradient RL.
1. **On-policy vs off-policy.** Define the distinction precisely. What property of an algorithm makes it on-policy or off-policy, and give two examples of each.
2. **TRPO (Trust Region Policy Optimization).** What problem does it solve relative to vanilla policy gradient, and what is the role of the KL-divergence "trust region" constraint? State the constrained objective.
3. **PPO (Proximal Policy Optimization).** How does PPO approximate TRPO's trust-region behavior with first-order optimization? Write the clipped surrogate objective and explain what the clipping prevents.
4. **GAE (Generalized Advantage Estimation).** Define the advantage function, then write the GAE estimator. Explain how the $\lambda$ parameter trades off bias against variance.
```hint Organizing principle
Frame all four around one theme: how to take a *useful but bounded* policy-improvement step. On/off-policy is about where the data comes from; TRPO/PPO are about how big a step is safe; GAE is about how accurately you estimate the direction (the advantage) of that step.
```
```hint Key quantities to define
Most of these hang off two objects: the importance-sampling ratio $r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_\text{old}}(a_t\mid s_t)}$ and the advantage $A(s,a)=Q(s,a)-V(s)$. TRPO and PPO both maximize $\mathbb{E}[r_t \hat A_t]$; they differ only in how they constrain the step.
```
```hint The TD residual links GAE to the rest
GAE is built from the one-step TD error $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$. Think about what an exponentially-weighted sum of future $\delta$'s gives you, and what happens at the two extremes of the weighting parameter.
```
#### What This Part Should Cover
- A crisp behavior-policy-vs-target-policy definition of on/off-policy, with sample-efficiency vs stability framing and correct examples (e.g. PPO/A2C on-policy; DQN/SAC off-policy).
- TRPO stated as constrained optimization: maximize the importance-weighted surrogate advantage subject to a bound on expected KL between old and new policy; awareness of the second-order machinery (Fisher-vector products / conjugate gradient).
- The PPO clipped objective written correctly, with the case analysis for $\hat A_t>0$ vs $\hat A_t<0$ and why $\min$ + clip yields a pessimistic (lower-bound) surrogate.
- The advantage definition plus the GAE geometric sum, and a correct $\lambda=0$ (low-variance, biased TD) vs $\lambda\to1$ (low-bias, high-variance Monte Carlo) characterization.
### Part B — Attention Mechanisms
Answer the following on efficient attention.
1. **Linear attention.** Why is it called "linear" — in which variable does complexity become linear? What approximation or structural change is made versus standard softmax attention, and what information can be lost relative to exact softmax?
2. **Group-Query Attention (GQA).** What is grouped/shared between heads? Why does this help inference efficiency and KV-cache memory? What are the typical quality/accuracy trade-offs, and how does GQA relate to multi-head and multi-query attention?
```hint Where the quadratic cost comes from
Standard attention is $\text{softmax}(QK^\top/\sqrt{d})V$ with $Q,K,V\in\mathbb{R}^{n\times d}$. Locate which intermediate matrix is $n\times n$ — that term is the entire reason cost is quadratic in sequence length.
```
```hint Linear attention: reorder the matmuls
If you could remove the softmax nonlinearity and write the scores as $\phi(Q)\phi(K)^\top$ for some feature map $\phi$, associativity lets you compute $\phi(K)^\top V$ *first*. Think about what that buys you and what property of softmax (normalization / competition between keys) you give up.
```
```hint GQA: what scales the KV cache
During autoregressive decoding you cache one K and one V per token *per KV head*. So ask: which knob — number of query heads or number of key/value heads — actually drives cache size and memory bandwidth? GQA sits between two familiar extremes.
```
#### What This Part Should Cover
- Correct identification that linear attention removes the explicit $n\times n$ matrix (via a kernel feature map and matmul re-association), making cost roughly linear in $n$; honest accounting of the lost exact-softmax normalization / key-competition / selectivity.
- Awareness that exact softmax (e.g. FlashAttention) is often still preferred at moderate context lengths because it preserves behavior while improving constants.
- GQA explained as sharing K/V across groups of query heads ($H_{kv}<H_q$), with the KV-cache-size argument tied to the number of KV heads, and the capacity-vs-efficiency trade-off.
- Placement of GQA on the spectrum: multi-head (max capacity, max KV cost) → GQA (middle ground) → multi-query (one KV head, cheapest, more potential quality loss).
### Follow-up Questions
- PPO and TRPO are usually run on-policy. What concretely breaks if you reuse a large replay buffer of old trajectories to maximize the PPO clipped objective, and how does the clip range $\epsilon$ interact with how stale that data is?
- GAE's $\lambda$ and the discount $\gamma$ both down-weight the future. How are their roles different, and why do practitioners tune them separately?
- Linear attention promises $O(n)$ scaling but is not dominant in frontier LLMs. Give two concrete reasons exact softmax attention (with FlashAttention-style kernels) still wins in practice.
- You must cut KV-cache memory by 4x on a serving model with 32 query heads. Compare moving to GQA with 8 KV heads versus multi-query (1 KV head) versus a linear-attention layer — what would you measure to choose?
Quick Answer: This question evaluates understanding of reinforcement learning policy types and modern policy-gradient techniques (TRPO, PPO, GAE) alongside attention mechanism variants (linear attention, Group‑Query Attention), testing competency in algorithmic trade-offs, stability, bias–variance dynamics, and efficiency considerations.