PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/TikTok

Explain RL policy types and modern policy gradients

Last updated: Jun 21, 2026

Quick Overview

This question evaluates understanding of reinforcement learning policy types and modern policy-gradient techniques (TRPO, PPO, GAE) alongside attention mechanism variants (linear attention, Group‑Query Attention), testing competency in algorithmic trade-offs, stability, bias–variance dynamics, and efficiency considerations.

  • medium
  • TikTok
  • Machine Learning
  • Software Engineer

Explain RL policy types and modern policy gradients

Company: TikTok

Role: Software Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

## Machine Learning Fundamentals: RL Policy Methods & Efficient Attention This is a conceptual machine-learning interview spanning two areas: modern policy-gradient reinforcement learning and efficient attention mechanisms. The interviewer is probing both your mental model of *why* each technique exists and your ability to state the key equations precisely. Answer each Part in order; expect rapid follow-ups asking you to justify a trade-off or write a formula. ### Constraints & Assumptions - This is a whiteboard/verbal discussion — no coding is required for these Parts. - You should be able to write the core objective/estimator for each method (e.g. the PPO clipped objective, the GAE sum) and explain every symbol. - Assume a standard discounted MDP with discount factor $\gamma \in [0,1)$, states $s$, actions $a$, reward $r_t$, a stochastic policy $\pi_\theta(a\mid s)$, and (where needed) a learned value function $V(s)$. - For the attention Part, assume a sequence of length $n$ with per-head dimension $d$, and standard scaled dot-product attention as the baseline. ### Clarifying Questions to Ask - Should I focus on the standard/original formulations of each method, or on specific production variants the team uses? - For the RL methods, do you want the math (objectives, KL constraint, GAE sum) or a higher-level intuition pass first? - For attention, are we discussing these in the context of LLM inference/serving, or training-time compute? - How much depth on implementation pitfalls (e.g. conjugate gradient for TRPO, KV-cache layout for GQA) versus conceptual definitions? ### Part A — Reinforcement Learning: Policy Methods Answer the following on policy-gradient RL. 1. **On-policy vs off-policy.** Define the distinction precisely. What property of an algorithm makes it on-policy or off-policy, and give two examples of each. 2. **TRPO (Trust Region Policy Optimization).** What problem does it solve relative to vanilla policy gradient, and what is the role of the KL-divergence "trust region" constraint? State the constrained objective. 3. **PPO (Proximal Policy Optimization).** How does PPO approximate TRPO's trust-region behavior with first-order optimization? Write the clipped surrogate objective and explain what the clipping prevents. 4. **GAE (Generalized Advantage Estimation).** Define the advantage function, then write the GAE estimator. Explain how the $\lambda$ parameter trades off bias against variance. ```hint Organizing principle Frame all four around one theme: how to take a *useful but bounded* policy-improvement step. On/off-policy is about where the data comes from; TRPO/PPO are about how big a step is safe; GAE is about how accurately you estimate the direction (the advantage) of that step. ``` ```hint Key quantities to define Most of these hang off two objects: the importance-sampling ratio $r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_\text{old}}(a_t\mid s_t)}$ and the advantage $A(s,a)=Q(s,a)-V(s)$. TRPO and PPO both maximize $\mathbb{E}[r_t \hat A_t]$; they differ only in how they constrain the step. ``` ```hint The TD residual links GAE to the rest GAE is built from the one-step TD error $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$. Think about what an exponentially-weighted sum of future $\delta$'s gives you, and what happens at the two extremes of the weighting parameter. ``` #### What This Part Should Cover - A crisp behavior-policy-vs-target-policy definition of on/off-policy, with sample-efficiency vs stability framing and correct examples (e.g. PPO/A2C on-policy; DQN/SAC off-policy). - TRPO stated as constrained optimization: maximize the importance-weighted surrogate advantage subject to a bound on expected KL between old and new policy; awareness of the second-order machinery (Fisher-vector products / conjugate gradient). - The PPO clipped objective written correctly, with the case analysis for $\hat A_t>0$ vs $\hat A_t<0$ and why $\min$ + clip yields a pessimistic (lower-bound) surrogate. - The advantage definition plus the GAE geometric sum, and a correct $\lambda=0$ (low-variance, biased TD) vs $\lambda\to1$ (low-bias, high-variance Monte Carlo) characterization. ### Part B — Attention Mechanisms Answer the following on efficient attention. 1. **Linear attention.** Why is it called "linear" — in which variable does complexity become linear? What approximation or structural change is made versus standard softmax attention, and what information can be lost relative to exact softmax? 2. **Group-Query Attention (GQA).** What is grouped/shared between heads? Why does this help inference efficiency and KV-cache memory? What are the typical quality/accuracy trade-offs, and how does GQA relate to multi-head and multi-query attention? ```hint Where the quadratic cost comes from Standard attention is $\text{softmax}(QK^\top/\sqrt{d})V$ with $Q,K,V\in\mathbb{R}^{n\times d}$. Locate which intermediate matrix is $n\times n$ — that term is the entire reason cost is quadratic in sequence length. ``` ```hint Linear attention: reorder the matmuls If you could remove the softmax nonlinearity and write the scores as $\phi(Q)\phi(K)^\top$ for some feature map $\phi$, associativity lets you compute $\phi(K)^\top V$ *first*. Think about what that buys you and what property of softmax (normalization / competition between keys) you give up. ``` ```hint GQA: what scales the KV cache During autoregressive decoding you cache one K and one V per token *per KV head*. So ask: which knob — number of query heads or number of key/value heads — actually drives cache size and memory bandwidth? GQA sits between two familiar extremes. ``` #### What This Part Should Cover - Correct identification that linear attention removes the explicit $n\times n$ matrix (via a kernel feature map and matmul re-association), making cost roughly linear in $n$; honest accounting of the lost exact-softmax normalization / key-competition / selectivity. - Awareness that exact softmax (e.g. FlashAttention) is often still preferred at moderate context lengths because it preserves behavior while improving constants. - GQA explained as sharing K/V across groups of query heads ($H_{kv}<H_q$), with the KV-cache-size argument tied to the number of KV heads, and the capacity-vs-efficiency trade-off. - Placement of GQA on the spectrum: multi-head (max capacity, max KV cost) → GQA (middle ground) → multi-query (one KV head, cheapest, more potential quality loss). ### Follow-up Questions - PPO and TRPO are usually run on-policy. What concretely breaks if you reuse a large replay buffer of old trajectories to maximize the PPO clipped objective, and how does the clip range $\epsilon$ interact with how stale that data is? - GAE's $\lambda$ and the discount $\gamma$ both down-weight the future. How are their roles different, and why do practitioners tune them separately? - Linear attention promises $O(n)$ scaling but is not dominant in frontier LLMs. Give two concrete reasons exact softmax attention (with FlashAttention-style kernels) still wins in practice. - You must cut KV-cache memory by 4x on a serving model with 32 query heads. Compare moving to GQA with 8 KV heads versus multi-query (1 KV head) versus a linear-attention layer — what would you measure to choose?

Quick Answer: This question evaluates understanding of reinforcement learning policy types and modern policy-gradient techniques (TRPO, PPO, GAE) alongside attention mechanism variants (linear attention, Group‑Query Attention), testing competency in algorithmic trade-offs, stability, bias–variance dynamics, and efficiency considerations.

Related Interview Questions

  • Design multimodal deployment under compute limits - TikTok (easy)
  • Write self-attention and cross-entropy pseudocode - TikTok (medium)
  • Explain overfitting, dropout, normalization, RL post-training - TikTok (medium)
  • Answer ML fundamentals and diagnostics questions - TikTok (hard)
  • Implement AUC-ROC, softmax, and logistic regression - TikTok (medium)
|Home/Machine Learning/TikTok

Explain RL policy types and modern policy gradients

TikTok logo
TikTok
Jan 11, 2026, 12:00 AM
mediumSoftware EngineerTechnical ScreenMachine Learning
4
0
Loading...

Machine Learning Fundamentals: RL Policy Methods & Efficient Attention

This is a conceptual machine-learning interview spanning two areas: modern policy-gradient reinforcement learning and efficient attention mechanisms. The interviewer is probing both your mental model of why each technique exists and your ability to state the key equations precisely. Answer each Part in order; expect rapid follow-ups asking you to justify a trade-off or write a formula.

Constraints & Assumptions

  • This is a whiteboard/verbal discussion — no coding is required for these Parts.
  • You should be able to write the core objective/estimator for each method (e.g. the PPO clipped objective, the GAE sum) and explain every symbol.
  • Assume a standard discounted MDP with discount factor γ∈[0,1)\gamma \in [0,1)γ∈[0,1) , states sss , actions aaa , reward rtr_trt​ , a stochastic policy πθ(a∣s)\pi_\theta(a\mid s)πθ​(a∣s) , and (where needed) a learned value function V(s)V(s)V(s) .
  • For the attention Part, assume a sequence of length nnn with per-head dimension ddd , and standard scaled dot-product attention as the baseline.

Clarifying Questions to Ask

  • Should I focus on the standard/original formulations of each method, or on specific production variants the team uses?
  • For the RL methods, do you want the math (objectives, KL constraint, GAE sum) or a higher-level intuition pass first?
  • For attention, are we discussing these in the context of LLM inference/serving, or training-time compute?
  • How much depth on implementation pitfalls (e.g. conjugate gradient for TRPO, KV-cache layout for GQA) versus conceptual definitions?

Part A — Reinforcement Learning: Policy Methods

Answer the following on policy-gradient RL.

  1. On-policy vs off-policy. Define the distinction precisely. What property of an algorithm makes it on-policy or off-policy, and give two examples of each.
  2. TRPO (Trust Region Policy Optimization). What problem does it solve relative to vanilla policy gradient, and what is the role of the KL-divergence "trust region" constraint? State the constrained objective.
  3. PPO (Proximal Policy Optimization). How does PPO approximate TRPO's trust-region behavior with first-order optimization? Write the clipped surrogate objective and explain what the clipping prevents.
  4. GAE (Generalized Advantage Estimation). Define the advantage function, then write the GAE estimator. Explain how the λ\lambdaλ parameter trades off bias against variance.

What This Part Should Cover

  • A crisp behavior-policy-vs-target-policy definition of on/off-policy, with sample-efficiency vs stability framing and correct examples (e.g. PPO/A2C on-policy; DQN/SAC off-policy).
  • TRPO stated as constrained optimization: maximize the importance-weighted surrogate advantage subject to a bound on expected KL between old and new policy; awareness of the second-order machinery (Fisher-vector products / conjugate gradient).
  • The PPO clipped objective written correctly, with the case analysis for A^t>0\hat A_t>0A^t​>0 vs A^t<0\hat A_t<0A^t​<0 and why min⁡\minmin + clip yields a pessimistic (lower-bound) surrogate.
  • The advantage definition plus the GAE geometric sum, and a correct λ=0\lambda=0λ=0 (low-variance, biased TD) vs λ→1\lambda\to1λ→1 (low-bias, high-variance Monte Carlo) characterization.

Part B — Attention Mechanisms

Answer the following on efficient attention.

  1. Linear attention. Why is it called "linear" — in which variable does complexity become linear? What approximation or structural change is made versus standard softmax attention, and what information can be lost relative to exact softmax?
  2. Group-Query Attention (GQA). What is grouped/shared between heads? Why does this help inference efficiency and KV-cache memory? What are the typical quality/accuracy trade-offs, and how does GQA relate to multi-head and multi-query attention?

What This Part Should Cover

  • Correct identification that linear attention removes the explicit n×nn\times nn×n matrix (via a kernel feature map and matmul re-association), making cost roughly linear in nnn ; honest accounting of the lost exact-softmax normalization / key-competition / selectivity.
  • Awareness that exact softmax (e.g. FlashAttention) is often still preferred at moderate context lengths because it preserves behavior while improving constants.
  • GQA explained as sharing K/V across groups of query heads ( Hkv<HqH_{kv}<H_qHkv​<Hq​ ), with the KV-cache-size argument tied to the number of KV heads, and the capacity-vs-efficiency trade-off.
  • Placement of GQA on the spectrum: multi-head (max capacity, max KV cost) → GQA (middle ground) → multi-query (one KV head, cheapest, more potential quality loss).

Follow-up Questions

  • PPO and TRPO are usually run on-policy. What concretely breaks if you reuse a large replay buffer of old trajectories to maximize the PPO clipped objective, and how does the clip range ϵ\epsilonϵ interact with how stale that data is?
  • GAE's λ\lambdaλ and the discount γ\gammaγ both down-weight the future. How are their roles different, and why do practitioners tune them separately?
  • Linear attention promises O(n)O(n)O(n) scaling but is not dominant in frontier LLMs. Give two concrete reasons exact softmax attention (with FlashAttention-style kernels) still wins in practice.
  • You must cut KV-cache memory by 4x on a serving model with 32 query heads. Compare moving to GQA with 8 KV heads versus multi-query (1 KV head) versus a linear-attention layer — what would you measure to choose?
Loading comments...

Browse More Questions

More Machine Learning•More TikTok•More Software Engineer•TikTok Software Engineer•TikTok Machine Learning•Software Engineer Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.