Backpropagation Through Time (BPTT) - Complete Walkthrough Given Values Inputs - $x_1 = 1.0$ - $x_2 = 2.0$ - Initial hidden: $h_0 = 0$ Weights - $W_x = 0.5$ - $W_h = 0.1$ - $b = 0.0$ - $W_y = 2.0$ Target Outputs - $y_1^{target} = 1.0$ - $y_2^{target} = 2.0$ Loss Function $$L = \frac{1}{2}(y_1 - 1)^2 + \frac{1}{2}(y_2 - 2)^2$$ --- 🟦 1. FORWARD PASS Step 1: t = 1 $$h_1 = \tanh(W_x x_1 + W_h h_0)$$ $$= \tanh(0.5 \cdot 1 + 0.1 \cdot 0) = \tanh(0.5) \approx 0.4621$$ $$y_1 = W_y h_1 = 2.0 \cdot 0.4621 = 0.9242$$ Step 2: t = 2 $$h_2 = \tanh(W_x x_2 + W_h h_1)$$ $$= \tanh(0.5 \cdot 2 + 0.1 \cdot 0.4621)$$ $$= \tanh(1.0 + 0.04621) = \tanh(1.04621) \approx 0.7790$$ $$y_2 = W_y h_2 = 2.0 \cdot 0.7790 = 1.558$$ Loss Calculation $$L_1 = \frac{1}{2}(0.9242 - 1)^2 = 0.00288$$ $$L_2 = \frac{1}{2}(1.558 - 2)^2 = 0.0978$$ $$L = 0.1007$$ --- 🟥 2. BACKWARD PASS (BPTT) We compute gradients from t=2 backward to t=1. > ⚠️ Remember: The gradient for $h_1$ has TWO paths: > 1. Loss at t=1 > 2. Through $h_2$ (because $h_1$ affects $h_2$) 🔻 Step 1: Gradients at t = 2 dL/dy₂ $$\frac{\partial L}{\partial y_2} = y_2 - y_2^{target} = 1.558 - 2 = -0.442$$ Since $\frac{\partial y_2}{\partial h_2} = W_y = 2$: $$\frac{\partial L}{\partial h_2} = -0.442 \cdot 2 = -0.884$$ Backprop through tanh $$\tanh'(a_2) = 1 - h_2^2 = 1 - (0.7790)^2 = 1 - 0.6069 = 0.3931$$ So: $$\frac{\partial L}{\partial a_2} = \frac{\partial L}{\partial h_2} \cdot \tanh'(h_2) = -0.884 \cdot 0.3931 = -0.3475$$ Where $a_2 = W_x x_2 + W_h h_1$. Gradient wrt W_x $$\frac{\partial L}{\partial W_x}\bigg|_{t=2} = \frac{\partial L}{\partial a_2} \cdot x_2 = -0.3475 \cdot 2 = -0.695$$ Gradient wrt W_h $$\frac{\partial L}{\partial W_h}\bigg|_{t=2} = \frac{\partial L}{\partial a_2} \cdot h_1 = -0.3475 \cdot 0.4621 = -0.1606$$ 🔻 Step 2: Gradient flows back to h₁ Because $h_1$ affects: 1. Loss at t=1 2. $h_2$ From t = 2: $$\frac{\partial L}{\partial h_1}\bigg|_{\text{via } t=2} = \frac{\partial L}{\partial a_2} \cdot W_h = -0.3475 \cdot 0.1 = -0.03475$$ 🔻 Step 3: Gradients at t = 1 dL/dy₁ $$\frac{\partial L}{\partial y_1} = y_1 - 1 = 0.9242 - 1 = -0.0758$$ $$\frac{\partial L}{\partial h_1}\bigg|_{\text{via } y_1} = -0.0758 \cdot 2 = -0.1516$$ TOTAL gradient wrt h₁ $$\frac{\partial L}{\partial h_1} = -0.1516 + (-0.03475) = -0.18635$$ Backprop through tanh for t=1 $$\tanh'(h_1) = 1 - h_1^2 = 1 - (0.4621)^2 = 1 - 0.2135 = 0.7865$$ $$\frac{\partial L}{\partial a_1} = -0.18635 \cdot 0.7865 = -0.1466$$ Gradients for W_x at t=1 $$\frac{\partial L}{\partial W_x}\bigg|_{t=1} = -0.1466 \cdot x_1 = -0.1466$$ Gradients for W_h at t=1 $$\frac{\partial L}{\partial W_h}\bigg|_{t=1} = -0.1466 \cdot h_0 = 0$$ (because $h_0 = 0$) --- 🟩 3. Combine Gradients Across Timesteps $$\frac{\partial L}{\partial W_x} = -0.1466 + (-0.695) = -0.8416$$ $$\frac{\partial L}{\partial W_h} = -0.1606 + 0 = -0.1606$$ $$\frac{\partial L}{\partial W_y} = h_1 \cdot \frac{dL}{dy_1} + h_2 \cdot \frac{dL}{dy_2}$$ $$= 0.4621(-0.0758) + 0.779(-0.442)$$ $$= -0.0350 - 0.344 = -0.379$$ --- 🎯 Final Gradients | Parameter | Gradient | |-----------|----------| | $\frac{dL}{dW_x}$ | -0.8416 | | $\frac{dL}{dW_h}$ | -0.1606 | | $\frac{dL}{dW_y}$ | -0.379 | These would be used to update weights: $$W \leftarrow W - \eta \frac{\partial L}{\partial W}$$ where $\eta$ is the learning rate.