Consider a neural network block whose output is produced by multiplying a sequence of trainable weight matrices before applying the result to an input.
Let the trainable matrices be W1,W2,…,Wi−1. Define the cumulative product
Ci=W1W2⋯Wi−1.
Given an input vector or mini-batch X, the forward pass is
Zi=CiX=W1W2⋯Wi−1X.
Assume there is a scalar loss function L, and that the upstream gradient
G=∂Zi∂L
is provided by the loss function or by later layers.
Derive the backward pass for this block. Specifically:
-
Express the gradient with respect to each individual matrix
Wj
, for every
1≤j<i
.
-
Show how the multivariate chain rule applies to the matrix product.
-
Ensure the resulting gradient
∂Wj∂L
has the same shape as
Wj
.
-
Describe an efficient implementation that avoids recomputing the same prefix and suffix matrix products repeatedly.