This question evaluates understanding of differentiable routing in Mixture-of-Experts architectures within Machine Learning, covering gradient estimation methods (such as STE, Gumbel-Softmax, steep relaxations, and REINFORCE), auxiliary load-balancing losses, and strategies to avoid expert collapse.
You are working with an MoE layer that routes each token to k experts (often k ∈ {1, 2}). The current router performs hard, non-differentiable decisions (e.g., argmax over logits), preventing end-to-end training via gradient descent.
Let the router produce logits z ∈ R^E for E experts per token. Hard routing uses g = one_hot(argmax(z)) (or top-k), and the layer output is y = Σ_j g_j · Expert_j(x).
Propose modifications to make the routing parameter learnable via gradient descent. Compare at least two approaches from the following (you may cover more):
Your comparison should address:
Additionally, specify:
Login required